Monday, May 19, 2025

What is a random variable?

Just because we can't predict something exactly doesn't mean we can't say anything about it at all

There are all kinds of problems where we can't say exactly what the value of something is, but we can still say useful things about it. Here are some examples.

The number of goals scored in a football or hockey match. We might not be able to predict the number of goals scored in a particular match, but we can say something:

We know that the number of goals must be an integer greater than or equal to 0.
We know that the number of goals is likely to be low and that high scores are unlikely; seeing two goals is far more likely than seeing 100 goals.

The number of people buying tickets at a movie theater. We know this will depend on the time of year, the day of the week, the weather, and the movies playing, etc. but even allowing for these factors, there's randomness. People might go on dates (or cancel them) or decide on a whim to see a movie. In this case, we know the minimum tickets is zero, the maximum is the number of seats, and that only an integer number of tickets can be sold.
The speed of a car on the freeway. Plainly, this is affected by a number of factors, but there's also randomness at play. We know the speed will be a real number greater than zero. We know that in the absence of traffic, it's more likely the car will be traveling at the speed limit than say 20mph.
The score you get by rolling a dice.

(Dietmar Rabich / Wikimedia Commons / “Würfel, gemischt -- 2021 -- 5577” / CC BY-SA 4.0

	For print products:	Dietmar Rabich / https://commons.wikimedia.org/wiki/File:W%C3%BCrfel,_gemischt_--_2021_--_5577.jpg / https://creativecommons.org/licenses/by-sa/4.0/
	Alternatively:	Dietmar Rabich / https://w.wiki/9A49 / https://creativecommons.org/licenses/by-sa/4.0/)

In all these cases, we're trying to measure something, but randomness is at play, which means we can't predict an exact result, but we can still make probabilistic predictions. We can also do math with these predictions, which means we can use them to build computer models and make predictions about how a system might behave.

The variables we're trying to measure are called random variables and I'm going to describe what they are in this blog post. I'm going to start by providing some background ideas we'll need to understand, then I'm going to show you why random variables are useful.

What is a mathematical function?

Functions are going to be important to this story, so bear with me.

In math, a function is some operation where you give it some input and it produces some output. The classic examples you may remember are the trigonometric functions like $sin(x)$, $cos(x)$, and $tan(x)$. A function could have several inputs, for example, this is a function: $z = a_0 + a_1x^1 + a_2 y^3$.

Functions are very common in math, so much so that it can be a little hard to spot them, as we'll see.

Describing randomness - distributions

A probability distribution is a math function that tells you how likely the outcome of a process is. For example, a traffic light can be red, yellow, or green. How likely is it that the next traffic light I come to will be red, yellow, or green? It must be one of them, so the probabilities must sum to one, but we know that yellow is shorter than red or green, so yellow is less likely. Obviously, we can discuss the relative likelihood of red or green.

Probability distributions can get very complicated, but many of them follow well-known patterns. For example, when rolling an unbiased dice, the probability distribution is a discrete uniform distribution that looks like this:

the number of goals scored in a hockey or football match is known to be well-modeled by a (discrete) Poisson distribution that looks like this:

male (or female) heights are well-modeled by a (continuous) normal distribution that looks like this:

There are hundreds of known distributions, but in practice, only a few are "popular".

Discrete or continuous

There are two type of measurements we typically take: continuous and discrete.

Discrete measurements are things that come in discrete chunks, for example, the number of sheep in a flock, the number of goals in a match, the number of people in a movie theater, and so on. Categorical variables are "sort of" discrete, for example the colors of a traffic light, though they are a special case.

Continuous measurements are things that can take any value (including any number of digits after the decimal point). For example, the speed of a car on the freeway could be 72.15609... mph, someone's height might be 183.876... cm and so on.

This seems clear, but sometimes we muddy the waters a bit. Let's say we're measuring height and we measure in whole cm. This transforms the measurement from a continuous one to a discrete one.

There are two types of probability distribution: continuous and discrete. We use continuous distributions for continuous quantities and discrete for discrete quantities. You should note that in the real world, it's often not this simple.

Random variables

A random variable is a math function the output of which depends on some random process. The values of the random variable follow a probability distribution. Here are some examples of observations that we can describe using random variables:

the lifetime of a lightbulb
goals scored
the result of rolling a dice
the speed of cars of a freeway
the height of a person
sales revenue

Dice are easy to understand, so I'll use it as an example. We don't know what the result of throwing the dice will be, but we know the probability distribution is uniform discrete, so the probability of throwing a 1 is $\dfrac{1}{6}$, the probability of throwing a 2 is $\dfrac{1}{6}$, and so on. Let's say we're gambling on dice, betting $1 and winning $6 if our number comes up. Using random variable math, we can work out what our gain or loss might be. In the dice example, it's trivial, but in other cases, it gets harder and we need some more advanced math.

Random variables have a set of all possible results, which can be finite or infinite, that's called the sample space. The sample space is a set denoted by $\Omega$. For the dice example, the sample space is simply:

\[\Omega = \{1,2,3,4,5,6\}\]

For a continuous quantity, like the lifetime of a bulb:

\[\Omega = \{x | x ∈ \mathbb{R} \} \]

which means an infinite sample space.

Infinite sample spaces, or large discrete sample spaces means we can't work things out by hand, we need more powerful math to do anything useful, and that's where things get hard.

A measurement (or observation) is the process of selecting a value from the sample space. Remember, the random variable has a probability distribution that tells you how likely different values are to be selected.

Arithmetic with random variables - doing something useful

In this section and the next, I'll start to show you some interesting things you can do with random variables. To illustrate a key idea, we'll use a simple example. We'll work out the probability distribution for the combined scores we get by throwing two unbiased dice.

We know the distribution is uniform for both dice, so we could work it out by hand like this:

Table 1: combining the scores of two dice
Dice 1	Dice 2	Combined score	Probability
1	1	2	$\dfrac{1}{36}$
1	2	3	$\dfrac{1}{36}$
1	3	4	$\dfrac{1}{36}$
...
2	1	3	$\dfrac{1}{36}$
2	2	4	$\dfrac{1}{36}$
2	3	5	$\dfrac{1}{36}$
...

the next step is adding up the probabilities of the combined scores:

there's only one way of getting 2, so it's probability is $\dfrac{1}{36}$
there are two ways of getting 3, so it's probability is $\dfrac{1}{36} + \dfrac{1}{36}$
...

this is really tedious, and obviously would be hugely expensive for a large sample space. There's a much faster way I'm going to show you.

To add two random variables, we use a process called convolution. This is a fancy way of saying we multiply the elements of one random variables by all the elements of the other random variable and add the probabilities. Mathematically, it looks like this for a discrete random variables, where $f$ is the distribution for the first dice and $g$ the distribution for the second dice:

\[f * g[n] = \sum_{m=-M}^{M}{f[n-m]g[n]}\]

In Python, we need to do it in two stages: work out the sample space and work out the probabilities. Here's some code to do it for two dice.

import numpy as np

score1, score2 = np.arange(1, 7), np.arange(1, 7)
prob1, prob2 = np.ones(6) / 6, np.ones(6) / 6

combo_score = list(range(score1[0] + score2[0], score1[-1] + score2[-1] + 1))
combo_prob = np.convolve(prob1, prob2)

print(combo_score)
print(combo_prob)

This is easy to do by hand for two dice, but not when the the data sets get a lot bigger, that's when we need computers.

The discrete case is easy enough, but the continuous case is harder and the math is more advanced. Let's take an example to make things more concrete. Let's imagine a company with two sales areas. An analyst is modeling them as continuous random variables. How do we work out the total sales? The answer is continuous convolution of the two sales areas and here's the answer:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau) \,d\tau\]

This is obviously a lot more complicated. It's so complicated, I'm going to spend a little time explaining how to do it.

Broadly speaking, there are three approaches to continuous convolution: special cases, symbolic calculation, and discrete approximations.

In a handful of cases, convolving two continuous random variables has known answers. For example, convolving normal distributions gives a normal distribution and convolving uniform distributions gives an Irwin-Hall distribution.

In almost all cases, it's possible to do a symbolic calculation using integration. You might think that something like SymPy could do it, but in practice, you need to do it by hand. Obviously, you need to be good at calculus. There are several textbooks that have some examples of the process and there are a number of discussions on StackOverflow. From what I've seen, college courses in advanced probability theory seem to have course questions on convolving random variables with different distributions and students have asked for help with them online. This should give you an inkling of the level of difficulty.

The final approach is to use discrete approximations to continuous functions and use discrete convolution. This tends to be the default in most cases.

Worked example with random variables: predicting revenue and net income

Let's say we want to model the total sales revenue ($t$) from several regions ($s_0, s_1, ...s_n$) that are independent. We also have a model of expenses for the company as a whole ($e$). How can we model total revenue and net income?

Let's assume the sales revenue in each region is modeled by random variables, each having a normal distribution. We have mean values $\mu_0, \mu_1, ..\mu_n$ and standard deviations $\alpha_0, \alpha_1, ...\alpha_n$. To get total sales, we have to do convolution:

\[t = s_0 * s_1 * ... * s_n\]

This sounds complicated, but for the normal distribution, there's a short-cut. Convolving normal with normal gives normal, all we have to do is add the means and the variances. So the total sales number is a normal distribution with mean and variance:

\[\mu = \sum_{i=0}^{n}\mu_i\]

\[\alpha^2 = \sum_{i=0}^{n}\alpha_{i}^{2}\]

Getting net income is tiny bit harder. If you remember your accountancy text books, net income $ni$ is:

\[ni = t - e\]

If expenses are modeled by the normal distribution, the answer here is just a variation of the process I used for combining sales. But what if expenses are modeled by some other distribution? That's where things get tough.

Combining random variables with different probability distributions is hard. There's no good inventory I could find on the web of known solutions. You can do the symbolic calculation by hand, but that requires a good grasp of calculus. You might think that something like SymPy would work, but at the time of writing, SymPy doesn't have a good way of doing it. The final way of doing it is using a discrete approximation, but that's time consuming to do. Bottom line: there's no easy solution if the distributions aren't all normal or aren't all uniform.

Division and multiplication with random variables

Most problems using random variables seem to boil down to adding them. If you need to multiply or divide random variables, there are ways to do it. The book "The Probability Lifesaver" by Stephen J. Miller explains how.

Minimum, maximum, and expected values

I said that convolving random variables can be very hard, but getting some values is pretty straightforward.

The maximum of two random variables $f$ and $g$ is simply $max(f) + max(g)$

The minimum of two random variables $f$ and $g$ is simply $min(f) + min(g)$

What about the mean? It turns out, getting the mean is easy too. The mean value of a random variable is often called the expectation value and is the result of a function called $E$, so the mean of a random value $X$ is $E(X)$. The formula for the mean of two random variables is:

\[E(X + Y) = E(X) + E(Y)\]

In simple words, we add the means.

Note I didn't say what the underlying distributions were. That's because it doesn't matter.

What if we apply some function to a random variable? It turns out, you can calculate the mean of a function of a random variable fairly easily and the arithmetic for combining multiple means is well known. There are pages on Wikipedia that will show you how to do it (in general, search for "linear combinations of expectation values" to get started).

Bringing it all together

There are a host of business and technical problems where we can't give a precise answer, but we can model the distribution of answers using random variables. There's a ton of theory surrounding the properties and uses of random variable, but it does get hard. By combining random variables, we can build models of more complicated systems, for example, we could forecast the range of net incomes for a company for a year. In some cases (e.g. normal distributions), combining random variables is easy, in other cases, it takes us in the world of calculus or discrete approximations.

Yes, random variables are hard, but they're very powerful.

Wednesday, May 14, 2025

You need to use Manus

What is Manus - agentic AI

Manus is an AI agent capable of performing a number of high-level tasks that previously could only be done by humans. For example, it can research an area (e.g. a machine learning method) and produce an intelligible report, it can even turn a report into an interactive website. You can get started on it for free.

It created a huge fuss on its release, and rightly so. The capabilities it offers are ground-breaking. We're now a few months later and it's got even better.

In this blog post, I'm going to provide you with some definitions, show you what Manus can do, give you some warnings, and provide you with some next steps.

If you want to get an invitation to Manus, contact me.

How it works

We need some definitions here.

An LLM (Large Language Model) is a huge computer model that's been trained on large bodies of text. That could be human language (e.g. English, Chinese) or it could be computer code (e.g. Python, JavaScript). An LLM can do things like:

extract meaning from text e.g. given a news article on a football match, it can tell you the score, who won, who lost, and other details from the text
predict the next word in a sentence or the next sentence in a paragraph
produce entire "works", for example, you can ask an LLM to write a play on a given theme.

A agent is an LLM that controls other LLMs without human intervention. For example, you might set it the task of building a user interface using react.js. The agent will interpret your task and break it down to several sub tasks. It will then ask LLMs to build code for each sub task and stitch the code together. More importantly for this blog post, you can use an agent to build a report for you on a topic. The agent will break down your request into chunks, assign those chunks to LLMs, and build an answer for you. An example topic might be "build me a report on what to do during a 10 day vacation in Brazil".

Manus is an agentic AI. It will split your request into chunks, assign those chunks to LLMs (it could be the same LLM or it could be different ones depending on the task), and combine the results into a report.

An example

I gave the following instructions to Manus:

You are an experienced technical professional. You will write a report explaining how logistic regression works for your colleagues. Your report will be a Word document. Your report will include the following sections:
* Why logistic regression is important.
* The theory and math behind it.
* A worked example. This will include code in Python using the appropriate libraries.
You will include the various math formula using the correct notation. You will provide references where appropriate.

Here's how it got started:

After it started, I realized I needed to modify my instructions, here's the dialog:

It incorporated my request and did add more sections.

Here's an example of how it kept me updated:

After 20 minutes, it produced a report in Word format. After reading the report, I realized I wanted to turn it into a blog post, so I asked Manus to give me the report as a HTML document, which it did.

I've posted the report as a blog post and you can read it here: https://blog.engora.com/2025/05/the-importance-of-logistic-regression.html

A critique of the Manus report

I'm familiar with logistic regression so I can critique what Manus returned. I'd give it a B+. This may sound a bit harsh, but that's a very credible result for 20 minutes of effort. It's enough to get going with but it's not enough of itself. Here's my assessment.

Writing style and use of English. Great. Better than most native English speakers.
Report organization. Great. Very clear and concise. Nicely formatted.
Technically correctness. I couldn't spot anything wrong with what it produced. It did miss important stuff out though and did have some oddities:

Logistic regression with more than two target variables, no mention of it.
Odds ratio can vary from from 0 to +$\infty$ but it didn't mention it. This is curious as it pointed out that linear regression can vary from -$\infty$ to +$\infty$ in the prior paragraphs.
Too terse description of the sigmoid function. It should have included a chart and it should have had a deeper discussion of some of the relevant properties of the function.
No meaningful discussion of decision boundaries (one mention in not enough detail).

Formula. A curious mixed bag. In some cases, it gave very good formula using the standard symbols and in other cases it gave code-like formula. This might be because I told it I wanted a Word report. By default, it uses markdown and it may be better to keep the report in markdown. It might be worth experimenting telling it use Latex for formula.
Code. Great.
References. Not great. No links back to the several online books that talk about logistic regression in some detail. No links to academic papers. The references it did provide were kind of OK, but really not enough and overall, not high quality enough.

To fix some of these issues, I could have tweaked my prompt, for example, telling it to use academic references, or giving it instructions to expand certain areas etc. This would cost more tokens. I could have told it to use high-effort reasoning which would also have cost me more tokens.

Tokens in AI

Computation isn't free and that's especially true of AI. Manus, in common with many other AI services, uses a "token" model. This report cost me 511 tokens. Manus gives you a certain number of tokens for free, which is enough for experimentation but not enough for commercial use.

What's been written about it

Other people have written about Manus too. Here are some reviews:

Who owns Manus

Manus is owned by a Chinese company called Monica (also known as Butterfly Effect AI) based in Wuhan.

Some cautions

As with any LLM or agentic AI, I suggest that you do not share company confidential information or PII. This includes data, but also includes text. Some LLMs/agents will use any data (including text) you supply to help train their models. This might be OK, but it also might not be OK - proceed with caution.

Before you use any agentic AI or an LLM for "production" use, I suggest a legal and risk review.

What does their system do with the data you send it? Does it retain the data, does it train the model? Is it resold?
What does their system do with the output (e.g. final report, generated code)?
Can you ask for your data to be removed from their model or system?

What this means - next steps

These types of agentic AI are game-changers. They will get you information you need far faster and far cheaper than a human could do it. The information isn't perfect and perhaps you wouldn't give it an A, but it's more than good enough to get started and frankly, most humans don't produce A work.

If you're involved in any kind of knowledge work, you should be experimenting with Manus and its competitors. This technology has obvious implications for employment and if you think you might be affected, it behoves you to understand what's going on.

If you want to get started, reach out to me to get an invitation to Manus and get extra free tokens.

The Importance of Logistic Regression

Note

With the exception of this note, everything else on this blog post was automatically created by Manus. I'm providing it as an example of what you can create.

In this separate blog post, I explain how I created this report and I provide an evaluation of it.

If you wanted to get started with Manus, contact me and I'll share an invitation with you.

Mike

======================================

The Importance of Logistic Regression

Logistic regression stands as a cornerstone in the field of machine learning and statistics, primarily recognized for its efficacy in tackling binary classification problems. Its importance stems from a combination of its interpretability, efficiency, and the foundational understanding it provides for more complex algorithms. Unlike linear regression, which predicts continuous outcomes, logistic regression is specifically designed to predict the probability of an instance belonging to a particular class, typically one of two (e.g., yes/no, true/false, 0/1). This probabilistic output is crucial in many real-world scenarios where a clear-cut decision boundary is needed, but an understanding of the likelihood of each outcome is also valuable.

One of the key reasons for logistic regression’s widespread adoption is its relative simplicity and ease of implementation. It serves as an excellent starting point for individuals venturing into predictive modeling and classification tasks. The mathematical underpinnings, while involving concepts like the sigmoid function and log-odds, are generally more accessible than those of more sophisticated models like neural networks or support vector machines. This accessibility does not, however, detract from its power. Logistic regression can provide robust and accurate predictions, especially when the relationship between the independent variables and the log-odds of the dependent variable is approximately linear.

Furthermore, the interpretability of logistic regression models is a significant advantage. The coefficients derived from a trained logistic regression model can be directly interpreted in terms of the odds ratio. This allows practitioners to understand the influence of each independent variable on the likelihood of the outcome. For instance, in a medical diagnosis scenario, a logistic regression model can not only predict the probability of a patient having a certain disease but also quantify how factors like age, weight, or specific test results contribute to that probability. This level of insight is invaluable in fields where understanding the ‘why’ behind a prediction is as important as the prediction itself.

Logistic regression is also computationally efficient, making it suitable for large datasets and real-time applications. Training a logistic regression model is generally faster compared to more complex algorithms, and making predictions is also quick. This efficiency, combined with its good performance on many binary classification tasks, makes it a go-to algorithm for a wide range of applications. These applications span various domains, including medical diagnosis (e.g., predicting disease presence), finance (e.g., credit scoring, fraud detection), marketing (e.g., predicting customer churn or purchase likelihood), and social sciences (e.g., predicting voting behavior).

Moreover, logistic regression serves as a fundamental building block for understanding more advanced classification techniques. Many concepts introduced in logistic regression, such as the use of a link function (the sigmoid function), maximum likelihood estimation for parameter fitting, and the evaluation of model performance using metrics like accuracy, precision, recall, and AUC-ROC, are transferable to other machine learning algorithms. Therefore, a solid grasp of logistic regression provides a strong foundation for learning and applying more complex models.

In summary, the importance of logistic regression is multifaceted. It is a powerful yet relatively simple and interpretable classification algorithm that provides probabilistic outputs. Its computational efficiency, wide range of applications, and its role as a foundational concept in machine learning solidify its place as an essential tool in the data scientist’s and statistician’s toolkit. Whether used as a standalone model or as a baseline for comparison with more complex methods, logistic regression continues to be a highly relevant and valuable technique in the world of data analysis and predictive modeling.

The Theory and Math Behind Logistic Regression

Logistic regression, despite its name, is a statistical model used for binary classification tasks, meaning it predicts the probability of an instance belonging to one of two classes. The core idea is to model the probability that a given input point belongs to a certain class. To understand its mechanics, we need to delve into concepts like the odds, the logit function, the sigmoid (or logistic) function, and the method of maximum likelihood estimation for fitting the model.

From Linear Regression to Probabilities

Linear regression predicts a continuous output, y, based on a linear combination of input features, X. The equation for a simple linear regression with one feature is y = β₀ + β₁x. For multiple features, this becomes y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ. However, the output of linear regression can range from -∞ to +∞, which is not suitable for probabilities that must lie between 0 and 1.

To address this, logistic regression transforms the linear combination of inputs using a function that maps any real-valued number into the (0, 1) interval. This function is the sigmoid function, also known as the logistic function.

The Sigmoid (Logistic) Function

The sigmoid function is defined as:

σ(z) = 1 / (1 + e^(-z))

Here, ‘z’ represents the linear combination of input features and their corresponding coefficients (weights): z = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ. The output of the sigmoid function, σ(z), is the estimated probability P(Y=1|X), i.e., the probability that the dependent variable Y is 1 (e.g., ‘pass’, ‘yes’, ‘disease present’) given the input features X. As z approaches +∞, e^(-z) approaches 0, and σ(z) approaches 1. Conversely, as z approaches -∞, e^(-z) approaches +∞, and σ(z) approaches 0. This S-shaped curve is ideal for modeling probabilities.

Odds and Log-Odds (Logit)

To understand the derivation of the logistic regression model, it’s helpful to consider the concept of odds. The odds of an event occurring is the ratio of the probability of the event occurring to the probability of it not occurring:

Odds = P(Y=1|X) / P(Y=0|X)

Since P(Y=0|X) = 1 - P(Y=1|X), we can write:

Odds = P(Y=1|X) / (1 - P(Y=1|X))

If we let p(X) = P(Y=1|X) = σ(z) = 1 / (1 + e^(-z)), then:

1 - p(X) = 1 - [1 / (1 + e^(-z))] = (1 + e^(-z) - 1) / (1 + e^(-z)) = e^(-z) / (1 + e^(-z))

So, the odds become:

Odds = [1 / (1 + e^(-z))] / [e^(-z) / (1 + e^(-z))] = 1 / e^(-z) = e^z

Now, taking the natural logarithm of the odds gives us the log-odds, also known as the logit function:

logit(p(X)) = ln(Odds) = ln(e^z) = z

Thus, we have:

ln(p(X) / (1 - p(X))) = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ

This equation shows that the log-odds of the outcome is a linear function of the input features. This is the fundamental relationship that logistic regression models. The coefficients (β) can be interpreted in terms of the change in log-odds for a one-unit change in the corresponding feature, holding other features constant. Exponentiating a coefficient gives the odds ratio.

Model Fitting: Maximum Likelihood Estimation (MLE)

Unlike linear regression, where coefficients are typically estimated using Ordinary Least Squares (OLS), logistic regression coefficients are estimated using Maximum Likelihood Estimation (MLE). MLE is a method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the given data.

For a dataset with ‘n’ independent observations {(xᵢ, yᵢ)}, where xᵢ is the vector of features for the i-th observation and yᵢ is its binary outcome (0 or 1), the likelihood function L(β) is the product of the probabilities of observing each yᵢ given xᵢ and the parameters β:

L(β) = Πᵢ [p(xᵢ) ^ yᵢ] * [(1 - p(xᵢ)) ^ (1 - yᵢ)]

where p(xᵢ) = σ(β₀ + β₁x₁ᵢ + … + βₚxₚᵢ) is the predicted probability for the i-th observation.

It is often easier to work with the log-likelihood function, ll(β), because it converts the product into a sum:

ll(β) = ln(L(β)) = Σᵢ [yᵢ * ln(p(xᵢ)) + (1 - yᵢ) * ln(1 - p(xᵢ))]

Substituting p(xᵢ) = 1 / (1 + e^(-zᵢ)) and 1 - p(xᵢ) = e^(-zᵢ) / (1 + e^(-zᵢ)), where zᵢ = β₀ + β₁x₁ᵢ + … + βₚxₚᵢ, the log-likelihood becomes:

ll(β) = Σᵢ [yᵢ * zᵢ - ln(1 + e^(zᵢ))]

To find the values of β that maximize this log-likelihood function, we typically use iterative optimization algorithms like Gradient Ascent (since we are maximizing) or Newton-Raphson. These algorithms start with initial estimates for β and iteratively update them until the log-likelihood converges to a maximum. There is no closed-form solution for the β coefficients in logistic regression, unlike in linear regression.

Assumptions of Logistic Regression

While logistic regression is more flexible than linear regression, it still relies on a few key assumptions:

Binary Dependent Variable: The dependent variable must be binary or dichotomous (e.g., 0/1, yes/no). For more than two categories, extensions like multinomial or ordinal logistic regression are used.
Independence of Observations: The observations should be independent of each other. This is a common assumption for many statistical models.
Linearity of Log-Odds: The relationship between the independent variables and the log-odds of the outcome is assumed to be linear. This can be checked using techniques like the Box-Tidwell test or by plotting residuals.
Absence of Multicollinearity: There should be little or no multicollinearity among the independent variables. High multicollinearity can make it difficult to estimate the individual effects of the predictors.
Large Sample Size: Logistic regression typically requires a reasonably large sample size to achieve stable and reliable estimates of the coefficients.

Understanding these theoretical and mathematical underpinnings is crucial for effectively applying logistic regression, interpreting its results, and diagnosing potential issues.

Worked Example: Logistic Regression in Python

This section provides a practical, step-by-step demonstration of how to implement logistic regression using Python. We will leverage popular libraries such as pandas for data manipulation, scikit-learn for machine learning tasks including model building and evaluation, and numpy for numerical operations. For this example, we will use the well-known Breast Cancer Wisconsin (Diagnostic) dataset, which is conveniently available within scikit-learn. This dataset presents a binary classification problem: predicting whether a breast mass is malignant or benign based on several computed features from digitized images of fine needle aspirates (FNA).

1. Importing Necessary Libraries

The first step in any Python-based data science task is to import the required libraries. We will need pandas for creating and managing DataFrames, numpy for numerical computations (though its direct use might be minimal here, it underpins scikit-learn), and several modules from scikit-learn for data splitting, model implementation, preprocessing, and metrics.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_breast_cancer # Using a built-in dataset for simplicity

2. Loading and Exploring the Dataset

We load the breast cancer dataset using load_breast_cancer() from sklearn.datasets. The data and feature names are then used to create a pandas DataFrame for easier manipulation and inspection. The target variable, indicating whether a tumor is malignant (1) or benign (0), is added as a new column to this DataFrame.

# Load the dataset
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target

Before proceeding with modeling, it is crucial to perform some initial exploratory data analysis (EDA). We display the first few rows of the DataFrame using df.head() to get a feel for the data, df.info() to understand the data types and check for missing values, and df["target"].value_counts() to see the distribution of the target classes.

print("--- Dataset Head ---")
print(df.head())
print("\n--- Dataset Info ---")
df.info()
print("\n--- Target Value Counts ---")
print(df["target"].value_counts())

This initial exploration helps confirm that the dataset is loaded correctly, identify the nature of the features (all appear to be numerical in this case), and understand the balance of the classes in the target variable, which is important for classification tasks.

3. Defining Features and Target Variable

Next, we separate the dataset into features (independent variables, denoted as X) and the target variable (dependent variable, denoted as y). X will contain all columns except the ‘target’ column, and y will consist solely of the ‘target’ column.

# Define features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

4. Splitting Data into Training and Testing Sets

To evaluate the performance of our logistic regression model on unseen data, we split the dataset into a training set and a testing set. The model will be trained on the training set, and its predictive performance will be assessed on the testing set. We use train_test_split from sklearn.model_selection for this purpose. A common split is 80% for training and 20% for testing. Setting random_state ensures that the split is the same every time the code is run, making the results reproducible. The stratify=y argument ensures that the proportion of the target classes is maintained in both the training and testing sets, which is particularly important for imbalanced datasets.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\n--- Shape of Training Data ---")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"--- Shape of Testing Data ---")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

5. Feature Scaling

Many machine learning algorithms, including logistic regression (especially when using certain solvers like ‘lbfgs’ or when regularization is applied), perform better when the input numerical features are on a similar scale. Feature scaling standardizes the range of independent variables. We use StandardScaler from sklearn.preprocessing, which standardizes features by removing the mean and scaling to unit variance. The scaler is fit only on the training data to prevent data leakage from the test set, and then used to transform both the training and testing data.

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

6. Initializing and Training the Logistic Regression Model

With the data prepared, we can now initialize and train our logistic regression model. We create an instance of the LogisticRegression class from sklearn.linear_model. For this example, we specify the solver="liblinear", which is a good choice for smaller datasets and binary classification, and set random_state for reproducibility. The max_iter parameter is increased to ensure the solver has enough iterations to converge. The model is then trained using the fit() method with the scaled training features (X_train_scaled) and the training target variable (y_train).

# Initialize and train the Logistic Regression model
log_reg_model = LogisticRegression(solver="liblinear", random_state=42, max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)

print("\n--- Model Training Complete ---")

7. Making Predictions

Once the model is trained, we can use it to make predictions on the test set (X_test_scaled). The predict() method returns the predicted class labels (0 or 1 in this case). We also use the predict_proba() method to obtain the predicted probabilities for each class. This provides the likelihood of an instance belonging to class 0 (benign) and class 1 (malignant).

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_scaled)
y_pred_proba = log_reg_model.predict_proba(X_test_scaled) # Get probabilities

print("\n--- Predictions Made ---")

8. Evaluating the Model

Model evaluation is crucial to understand how well our logistic regression model performs. We use several common metrics for classification tasks:

Accuracy: This is the proportion of correctly classified instances. It is calculated using accuracy_score.
Confusion Matrix: This table provides a detailed breakdown of correct and incorrect classifications for each class (True Positives, True Negatives, False Positives, False Negatives). It is generated using confusion_matrix.
Classification Report: This report, generated by classification_report, includes precision, recall, F1-score, and support for each class. These metrics provide a more nuanced view of performance, especially if the classes are imbalanced.
- Precision measures the accuracy of positive predictions (TP / (TP + FP)).
- Recall (or Sensitivity) measures the model’s ability to identify all actual positives (TP / (TP + FN)).
- F1-score is the harmonic mean of precision and recall, providing a single score that balances both.

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")

conf_matrix = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_matrix}")

class_report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print(f"\nClassification Report:\n{class_report}")

The output of these evaluations will indicate the model’s effectiveness. For instance, a high accuracy and balanced precision/recall scores suggest good performance.

9. Interpreting Predicted Probabilities

To further understand the model’s output, we can look at the predicted probabilities for a few samples from the test set. This shows the model’s confidence in its predictions.

# Display some predicted probabilities for the first few test samples
print("\n--- Predicted Probabilities for first 5 test samples (Benign, Malignant) ---")
for i in range(5):
    print(f"Sample {i+1}: Actual={y_test.iloc[i]}, Predicted Proba={y_pred_proba[i]}, Predicted Class={y_pred[i]}")

Each row in y_pred_proba contains two probabilities: the first for class 0 (benign) and the second for class 1 (malignant). The predict() method typically assigns the class with the higher probability (usually based on a 0.5 threshold).

10. Interpreting Model Coefficients

Finally, we can examine the coefficients (weights) learned by the logistic regression model. These coefficients indicate the relationship between each feature and the log-odds of the outcome. A positive coefficient suggests that an increase in the feature’s value increases the log-odds of the outcome being class 1 (malignant), while a negative coefficient suggests the opposite. We can also exponentiate these coefficients to get odds ratios, which are often easier to interpret. An odds ratio greater than 1 means the odds of the outcome (malignant) increase with an increase in the feature, while an odds ratio less than 1 means the odds decrease.

# Interpreting Coefficients
coefficients = pd.DataFrame(log_reg_model.coef_[0], X.columns, columns=["Coefficient"])
print("\n--- Model Coefficients (Log-Odds) ---")
print(coefficients.sort_values(by="Coefficient", ascending=False))

odds_ratios = np.exp(log_reg_model.coef_[0])
odds_ratios_df = pd.DataFrame(odds_ratios, X.columns, columns=["Odds Ratio"])
print("\n--- Model Odds Ratios ---")
print(odds_ratios_df.sort_values(by="Odds Ratio", ascending=False))

This step provides insights into which features are most influential in the model’s predictions. It is important to remember that these interpretations are based on the scaled features if feature scaling was applied.

This worked example covers the end-to-end process of applying logistic regression, from data loading and preprocessing to model training, evaluation, and basic interpretation. The specific results (accuracy, coefficients, etc.) will depend on the dataset and the chosen parameters, but the methodology remains consistent.

# Python Worked Example for Logistic Regression

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_breast_cancer # Using a built-in dataset for simplicity

# Load the dataset
# The breast cancer dataset is a classic binary classification dataset.
# Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
# They describe characteristics of the cell nuclei present in the image.
# The target variable is whether the mass is malignant (1) or benign (0).
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target

print("--- Dataset Head ---")
print(df.head())
print("\n--- Dataset Info ---")
df.info()
print("\n--- Target Value Counts ---")
print(df["target"].value_counts())

# Define features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

# Split the data into training and testing sets
# We use 80% of the data for training and 20% for testing.
# random_state is set for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\n--- Shape of Training Data ---")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"--- Shape of Testing Data ---")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

# Feature Scaling
# Logistic regression can benefit from feature scaling, especially when using solvers that are sensitive to feature magnitudes.
# StandardScaler standardizes features by removing the mean and scaling to unit variance.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model
# We use a simple logistic regression model with default parameters for this example.
# max_iter is increased to ensure convergence for some solvers.
log_reg_model = LogisticRegression(solver="liblinear", random_state=42, max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)

print("\n--- Model Training Complete ---")

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_scaled)
y_pred_proba = log_reg_model.predict_proba(X_test_scaled) # Get probabilities

print("\n--- Predictions Made ---")

# Evaluate the model
# Accuracy: The proportion of correctly classified instances.
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")

# Confusion Matrix: A table showing the performance of a classification model.
# Rows represent the actual classes, and columns represent the predicted classes.
# TN | FP
# FN | TP
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_matrix}")

# Classification Report: Provides precision, recall, F1-score, and support for each class.
# Precision: TP / (TP + FP) - Ability of the classifier not to label as positive a sample that is negative.
# Recall (Sensitivity): TP / (TP + FN) - Ability of the classifier to find all the positive samples.
# F1-score: 2 * (Precision * Recall) / (Precision + Recall) - Weighted average of Precision and Recall.
# Support: The number of actual occurrences of the class in the specified dataset.
class_report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print(f"\nClassification Report:\n{class_report}")

# Display some predicted probabilities for the first few test samples
print("\n--- Predicted Probabilities for first 5 test samples (Benign, Malignant) ---")
for i in range(5):
    print(f"Sample {i+1}: Actual={y_test.iloc[i]}, Predicted Proba={y_pred_proba[i]}, Predicted Class={y_pred[i]}")

# Interpreting Coefficients (Optional, but good for understanding)
# The coefficients represent the change in the log-odds of the outcome for a one-unit increase in the predictor variable,
# holding other variables constant.
coefficients = pd.DataFrame(log_reg_model.coef_[0], X.columns, columns=["Coefficient"])
print("\n--- Model Coefficients (Log-Odds) ---")
print(coefficients.sort_values(by="Coefficient", ascending=False))

# To get odds ratios, we can exponentiate the coefficients
odds_ratios = np.exp(log_reg_model.coef_[0])
odds_ratios_df = pd.DataFrame(odds_ratios, X.columns, columns=["Odds Ratio"])
print("\n--- Model Odds Ratios ---")
print(odds_ratios_df.sort_values(by="Odds Ratio", ascending=False))

print("\n--- End of Worked Example ---")

References

GeeksforGeeks. (2025, February 3). Logistic Regression in Machine Learning. GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/understanding-logistic-regression/
Rai, K. (2020, June 14). The math behind Logistic Regression. Analytics Vidhya on Medium. Retrieved from https://medium.com/analytics-vidhya/the-math-behind-logistic-regression-c2f04ca27bca
Wikipedia contributors. (2024, May 9). Logistic regression. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Logistic_regression
Scikit-learn developers. (n.d.). sklearn.linear_model.LogisticRegression. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Scikit-learn developers. (n.d.). sklearn.datasets.load_breast_cancer. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
Scikit-learn developers. (n.d.). sklearn.model_selection.train_test_split. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Scikit-learn developers. (n.d.). sklearn.preprocessing.StandardScaler. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Scikit-learn developers. (n.d.). sklearn.metrics module. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Pandas development team. (n.d.). Pandas documentation. Pandas. Retrieved from https://pandas.pydata.org/pandas-docs/stable/
NumPy developers. (n.d.). NumPy documentation. NumPy. Retrieved from https://numpy.org/doc/

Dice 1	Dice 2	Combined score	Probability
1	1	2	\(\dfrac{1}{36}\)
1	2	3	\(\dfrac{1}{36}\)
1	3	4	\(\dfrac{1}{36}\)
...
2	1	3	\(\dfrac{1}{36}\)
2	2	4	\(\dfrac{1}{36}\)
2	3	5	\(\dfrac{1}{36}\)
...