Just because we can't predict something exactly doesn't mean we can't say anything about it at all

There are all kinds of problems where we can't say exactly what the value of something is, but we can still say useful things about it. Here are some examples.

The number of goals scored in a football or hockey match. We might not be able to predict the number of goals scored in a particular match, but we can say something:

We know that the number of goals must be an integer greater than or equal to 0.
We know that the number of goals is likely to be low and that high scores are unlikely; seeing two goals is far more likely than seeing 100 goals.

The number of people buying tickets at a movie theater. We know this will depend on the time of year, the day of the week, the weather, and the movies playing, etc. but even allowing for these factors, there's randomness. People might go on dates (or cancel them) or decide on a whim to see a movie. In this case, we know the minimum tickets is zero, the maximum is the number of seats, and that only an integer number of tickets can be sold.
The speed of a car on the freeway. Plainly, this is affected by a number of factors, but there's also randomness at play. We know the speed will be a real number greater than zero. We know that in the absence of traffic, it's more likely the car will be traveling at the speed limit than say 20mph.
The score you get by rolling a dice.

(Dietmar Rabich / Wikimedia Commons / “Würfel, gemischt -- 2021 -- 5577” / CC BY-SA 4.0

	For print products:	Dietmar Rabich / https://commons.wikimedia.org/wiki/File:W%C3%BCrfel,_gemischt_--_2021_--_5577.jpg / https://creativecommons.org/licenses/by-sa/4.0/
	Alternatively:	Dietmar Rabich / https://w.wiki/9A49 / https://creativecommons.org/licenses/by-sa/4.0/)

In all these cases, we're trying to measure something, but randomness is at play, which means we can't predict an exact result, but we can still make probabilistic predictions. We can also do math with these predictions, which means we can use them to build computer models and make predictions about how a system might behave.

The variables we're trying to measure are called random variables and I'm going to describe what they are in this blog post. I'm going to start by providing some background ideas we'll need to understand, then I'm going to show you why random variables are useful.

What is a mathematical function?

Functions are going to be important to this story, so bear with me.

In math, a function is some operation where you give it some input and it produces some output. The classic examples you may remember are the trigonometric functions like $sin(x)$, $cos(x)$, and $tan(x)$. A function could have several inputs, for example, this is a function: $z = a_0 + a_1x^1 + a_2 y^3$.

Functions are very common in math, so much so that it can be a little hard to spot them, as we'll see.

Describing randomness - distributions

A probability distribution is a math function that tells you how likely the outcome of a process is. For example, a traffic light can be red, yellow, or green. How likely is it that the next traffic light I come to will be red, yellow, or green? It must be one of them, so the probabilities must sum to one, but we know that yellow is shorter than red or green, so yellow is less likely. Obviously, we can discuss the relative likelihood of red or green.

Probability distributions can get very complicated, but many of them follow well-known patterns. For example, when rolling an unbiased dice, the probability distribution is a discrete uniform distribution that looks like this:

the number of goals scored in a hockey or football match is known to be well-modeled by a (discrete) Poisson distribution that looks like this:

male (or female) heights are well-modeled by a (continuous) normal distribution that looks like this:

There are hundreds of known distributions, but in practice, only a few are "popular".

Discrete or continuous

There are two type of measurements we typically take: continuous and discrete.

Discrete measurements are things that come in discrete chunks, for example, the number of sheep in a flock, the number of goals in a match, the number of people in a movie theater, and so on. Categorical variables are "sort of" discrete, for example the colors of a traffic light, though they are a special case.

Continuous measurements are things that can take any value (including any number of digits after the decimal point). For example, the speed of a car on the freeway could be 72.15609... mph, someone's height might be 183.876... cm and so on.

This seems clear, but sometimes we muddy the waters a bit. Let's say we're measuring height and we measure in whole cm. This transforms the measurement from a continuous one to a discrete one.

There are two types of probability distribution: continuous and discrete. We use continuous distributions for continuous quantities and discrete for discrete quantities. You should note that in the real world, it's often not this simple.

Random variables

A random variable is a math function the output of which depends on some random process. The values of the random variable follow a probability distribution. Here are some examples of observations that we can describe using random variables:

the lifetime of a lightbulb
goals scored
the result of rolling a dice
the speed of cars of a freeway
the height of a person
sales revenue

Dice are easy to understand, so I'll use it as an example. We don't know what the result of throwing the dice will be, but we know the probability distribution is uniform discrete, so the probability of throwing a 1 is $\dfrac{1}{6}$, the probability of throwing a 2 is $\dfrac{1}{6}$, and so on. Let's say we're gambling on dice, betting $1 and winning $6 if our number comes up. Using random variable math, we can work out what our gain or loss might be. In the dice example, it's trivial, but in other cases, it gets harder and we need some more advanced math.

Random variables have a set of all possible results, which can be finite or infinite, that's called the sample space. The sample space is a set denoted by $\Omega$. For the dice example, the sample space is simply:

\[\Omega = \{1,2,3,4,5,6\}\]

For a continuous quantity, like the lifetime of a bulb:

\[\Omega = \{x | x ∈ \mathbb{R} \} \]

which means an infinite sample space.

Infinite sample spaces, or large discrete sample spaces means we can't work things out by hand, we need more powerful math to do anything useful, and that's where things get hard.

A measurement (or observation) is the process of selecting a value from the sample space. Remember, the random variable has a probability distribution that tells you how likely different values are to be selected.

Arithmetic with random variables - doing something useful

In this section and the next, I'll start to show you some interesting things you can do with random variables. To illustrate a key idea, we'll use a simple example. We'll work out the probability distribution for the combined scores we get by throwing two unbiased dice.

We know the distribution is uniform for both dice, so we could work it out by hand like this:

Table 1: combining the scores of two dice
Dice 1	Dice 2	Combined score	Probability
1	1	2	$\dfrac{1}{36}$
1	2	3	$\dfrac{1}{36}$
1	3	4	$\dfrac{1}{36}$
...
2	1	3	$\dfrac{1}{36}$
2	2	4	$\dfrac{1}{36}$
2	3	5	$\dfrac{1}{36}$
...

the next step is adding up the probabilities of the combined scores:

there's only one way of getting 2, so it's probability is $\dfrac{1}{36}$
there are two ways of getting 3, so it's probability is $\dfrac{1}{36} + \dfrac{1}{36}$
...

this is really tedious, and obviously would be hugely expensive for a large sample space. There's a much faster way I'm going to show you.

To add two random variables, we use a process called convolution. This is a fancy way of saying we multiply the elements of one random variables by all the elements of the other random variable and add the probabilities. Mathematically, it looks like this for a discrete random variables, where $f$ is the distribution for the first dice and $g$ the distribution for the second dice:

\[f * g[n] = \sum_{m=-M}^{M}{f[n-m]g[n]}\]

In Python, we need to do it in two stages: work out the sample space and work out the probabilities. Here's some code to do it for two dice.

import numpy as np

score1, score2 = np.arange(1, 7), np.arange(1, 7)
prob1, prob2 = np.ones(6) / 6, np.ones(6) / 6

combo_score = list(range(score1[0] + score2[0], score1[-1] + score2[-1] + 1))
combo_prob = np.convolve(prob1, prob2)

print(combo_score)
print(combo_prob)

This is easy to do by hand for two dice, but not when the the data sets get a lot bigger, that's when we need computers.

The discrete case is easy enough, but the continuous case is harder and the math is more advanced. Let's take an example to make things more concrete. Let's imagine a company with two sales areas. An analyst is modeling them as continuous random variables. How do we work out the total sales? The answer is continuous convolution of the two sales areas and here's the answer:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau) \,d\tau\]

This is obviously a lot more complicated. It's so complicated, I'm going to spend a little time explaining how to do it.

Broadly speaking, there are three approaches to continuous convolution: special cases, symbolic calculation, and discrete approximations.

In a handful of cases, convolving two continuous random variables has known answers. For example, convolving normal distributions gives a normal distribution and convolving uniform distributions gives an Irwin-Hall distribution.

In almost all cases, it's possible to do a symbolic calculation using integration. You might think that something like SymPy could do it, but in practice, you need to do it by hand. Obviously, you need to be good at calculus. There are several textbooks that have some examples of the process and there are a number of discussions on StackOverflow. From what I've seen, college courses in advanced probability theory seem to have course questions on convolving random variables with different distributions and students have asked for help with them online. This should give you an inkling of the level of difficulty.

The final approach is to use discrete approximations to continuous functions and use discrete convolution. This tends to be the default in most cases.

Worked example with random variables: predicting revenue and net income

Let's say we want to model the total sales revenue ($t$) from several regions ($s_0, s_1, ...s_n$) that are independent. We also have a model of expenses for the company as a whole ($e$). How can we model total revenue and net income?

Let's assume the sales revenue in each region is modeled by random variables, each having a normal distribution. We have mean values $\mu_0, \mu_1, ..\mu_n$ and standard deviations $\alpha_0, \alpha_1, ...\alpha_n$. To get total sales, we have to do convolution:

\[t = s_0 * s_1 * ... * s_n\]

This sounds complicated, but for the normal distribution, there's a short-cut. Convolving normal with normal gives normal, all we have to do is add the means and the variances. So the total sales number is a normal distribution with mean and variance:

\[\mu = \sum_{i=0}^{n}\mu_i\]

\[\alpha^2 = \sum_{i=0}^{n}\alpha_{i}^{2}\]

Getting net income is tiny bit harder. If you remember your accountancy text books, net income $ni$ is:

\[ni = t - e\]

If expenses are modeled by the normal distribution, the answer here is just a variation of the process I used for combining sales. But what if expenses are modeled by some other distribution? That's where things get tough.

Combining random variables with different probability distributions is hard. There's no good inventory I could find on the web of known solutions. You can do the symbolic calculation by hand, but that requires a good grasp of calculus. You might think that something like SymPy would work, but at the time of writing, SymPy doesn't have a good way of doing it. The final way of doing it is using a discrete approximation, but that's time consuming to do. Bottom line: there's no easy solution if the distributions aren't all normal or aren't all uniform.

Division and multiplication with random variables

Most problems using random variables seem to boil down to adding them. If you need to multiply or divide random variables, there are ways to do it. The book "The Probability Lifesaver" by Stephen J. Miller explains how.

Minimum, maximum, and expected values

I said that convolving random variables can be very hard, but getting some values is pretty straightforward.

The maximum of two random variables $f$ and $g$ is simply $max(f) + max(g)$

The minimum of two random variables $f$ and $g$ is simply $min(f) + min(g)$

What about the mean? It turns out, getting the mean is easy too. The mean value of a random variable is often called the expectation value and is the result of a function called $E$, so the mean of a random value $X$ is $E(X)$. The formula for the mean of two random variables is:

\[E(X + Y) = E(X) + E(Y)\]

In simple words, we add the means.

Note I didn't say what the underlying distributions were. That's because it doesn't matter.

What if we apply some function to a random variable? It turns out, you can calculate the mean of a function of a random variable fairly easily and the arithmetic for combining multiple means is well known. There are pages on Wikipedia that will show you how to do it (in general, search for "linear combinations of expectation values" to get started).

Bringing it all together

There are a host of business and technical problems where we can't give a precise answer, but we can model the distribution of answers using random variables. There's a ton of theory surrounding the properties and uses of random variable, but it does get hard. By combining random variables, we can build models of more complicated systems, for example, we could forecast the range of net incomes for a company for a year. In some cases (e.g. normal distributions), combining random variables is easy, in other cases, it takes us in the world of calculus or discrete approximations.

Yes, random variables are hard, but they're very powerful.

Engora Data Blog

Monday, May 19, 2025

What is a random variable?