Thursday, March 20, 2025

Compliance!

Compliance

Compliance means a company, and its employees, are following the rules so it doesn't get punished by regulators (e.g. fines), courts (e.g. adverse legal judgments), or the market (stock price drop), or all of them. Rules means following financial and privacy law, but also obeying contract rules. On the face of it, this all sounds like something only the finance and legal departments need to worry about, but increasingly data people (analysts, data scientists, data engineers) need to follow compliance rules too. In this blog post, I'll explain why compliance applies to you (data people) and what you can do about it.

(Get compliance wrong, and someone like this may be in your future. InfoGibraltar, CC BY 2.0, via Wikimedia Commons)

I'm not a lawyer, so don't take legal advice from me. What you should do is read this blog post, think about gaps in your compliance processes, and talk to your legal team.

Private data on people

By now, most data people understand that data that identifies individuals is covered by privacy laws and needs to be handled carefully. Data people also understand that there can be large fines for breaches or mishandling data. Unfortunately, this understanding often isn't enough and privacy laws are more complex and broader than many technical staff realize.

(Private Property sign by Oast House Archive, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons)

Several data privacy laws have an extraterritorial provision which means the law applies anywhere in the world (most notably, the GDPR). For example, a Mexican company processing data on French residents is covered by the GDPR even though the data processing takes place in Mexico. For a company operating internationally, this means obeying several sets of laws, which means in practice the strictest rules are used for everyone.

What is personally identifiable information (PII) sometimes isn't clear and can change suddenly. Most famously, the Court of Justice of the European Union (CJEU) ruled in the Breyer case that IP addresses can be PII under some circumstances. I'm not going to dive into the ruling here (you can look it up), but the court's logic is clear. What this ruling illustrates is that "common sense" views of what is and is not PII aren't good enough.

The GDPR defines a subset of data on people as "special categories of personal data" which are subject to more stringent regulation (this guide has more details). This includes data on sexuality, religion, political activities etc. Once again, this seems obvious in theory, but in practice is much harder. For example, the name of someone's partner can reveal their sexuality and is therefore sensitive data.

There are two types of private data on people companies handle that are often overlooked. Employee data is clearly private, but is usually closely held for obvious reasons. Customer data in CRM systems is also private data on people but tends to be less protected. Most CRM systems have prospect and contact names, job titles, phone numbers etc. and I've even heard of systems that list customers' hobbies and interests. Data protection rules apply to these systems too.

I've only just scratched the surface of the rules surrounding processing data on people but hopefully I've made clear that things aren't as straightforward as they appear. A company can break the law and be fined if its staff (e.g. data analysts, data scientists, data engineers etc.) handle data in a way contrary to the law.

Trading based on confidential information

Many companies provide services to other companies, e.g. HR, payroll, internet, etc. This gives service providers' employees access to confidential information on their customers. If you're a service provider, should you let your employees make securities transactions based on confidential customer information?

(Harshitha BN, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

A hypothetical case can make the risks clearer. Let's imagine a payroll company provides services to other companies, including several large companies. A data analyst at the payroll company spots ahead of time that one of their customers is laying off a large number of its employees. The data analyst trades securities in that company based on this confidential information. Later on, the fact that the data analyst made those trades becomes publicly known.

There are several possible consequences here.

Depending on the jurisdiction, this may count as "insider trading" and be illegal. It could lead to arrests and consequential bad publicity and reputational damage.
This could be a breach of contract and could lead to the service provider losing a customer.
At the very least, there will be commercial repercussions because the service provider has violated customer trust.

Imagine you're a company providing services to other companies. Regardless of the law, do you think it's a good idea for your employees to be buying or selling securities based on their confidential customer knowledge?

Legal contracts

This is a trickier area and gets companies into trouble. It's easiest if I give you a hypothetical case and point out the problems.

(Staselnik, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A company, ServiceCo, sells services into the mining industry in different countries. As part of its services, it sells a "MiningNetwork" product that lists mining companies and the names of people in various jobs in them (e.g. safety officers, geologists and so on). It also produces regular reports on the mining industry that it makes available for free on its website as part of its marketing efforts, this is called the "Mining Today Report".

For sales prospecting purposes, the sales team buys data from a global information vendor called GlobalData. The data ServiceCo buys lists all the mines owned by different companies (including joint ventures etc.) and has information on those mines (locations, what's being mined, workforce size etc). It also lists key employees at each of those mines. This data is very expensive, in part because it costs GlobalData a great deal of money to collect. The ServiceCo sales team incorporates the GlobaData data into their CRM and successfully goes prospecting. Although the data is expensive, the sales team are extracting value from it and it's worth it to them.

Some time later, a ServiceCo data analyst finds this data in an internal database and they realize it could be useful elsewhere. In conjunction with product management, they execute a plan to use it:

They augment the "MiningNetwork" product with GlobalData data. Some of this data ServiceCo already had, but the GlobalData adds new mine sites and new people and is a very significant addition. The data added is directly from the GlobalData data without further processing.
They augment their free "Mining Today Report" with the GlobalData data. In this case, it's a very substantial upgrade, increasing the scope of the report by 50% or more. In some cases, the additions to the report are based on conclusions drawn from the GlobalData data, in other cases it's a direct lift (e.g. mine locations).

Just prior to release, the analyst and the product manager report this work to the ServiceCo CTO and CEO in an internal pre-release demo call. The analyst is really happy to point out that this is a substantial new use for data that the company is paying a great deal of money for.

You are the CEO of ServiceCo. What do you do next and why?

Here's my answer. You ask the data analyst and the product manager if they've done a contract review with your legal team to check that this use of GlobalData's data is within the terms of the contract. You ask for the name of the lawyer they've worked with and you speak to the lawyer before the release goes out. If the answer isn't satisfactory, you stop the projects immediately regardless of any pre-announcements that have been made.

Why?

These two projects could put the company in substantial legal jeopardy. When you buy data, it usually comes with an agreement specifying allowed uses. Anything else is forbidden. In this case, the data was bought for sales prospecting purposes from a large and experienced data supplier (GlobalData). It's very likely that usage of this data will be restricted to sales prospecting and for internal use only. Bear in mind, GlobalData may well be selling the same sort of data to mining companies and other companies selling to mining companies. So there are likely two problems here:

The GlobalData data will used for purposes beyond the original license agreement.
The GlobalData data will be distributed to others companies free of charge (in the case of "Mining Today Report"), or for charge ("MiningNetwork") with no royalty payments to GlobalData. In effect, ServiceCo will go from being a user of GlobalData data to distributing GlobalData's data without paying them. ServiceCo will be doing this without an explicit agreement from GlobalData. This may well substantially damage GlobalData's business.

The second point is the most serious and could result in a lawsuit with substantial penalties.

The bottom line is simple. When you buy data, it comes with restrictions on how you use it. It's up to you to stick to the rules. If you don't, you may well get sued.

(I haven't mentioned "open source" data so far. Many freely available data sets have licensing provisions that forbid commercial use of the data. If that's the case, you can't use it for commercial purposes. Again, the onus is on you to check and comply.)

What can you do about it?

Fortunately, there are things you can do to manage the risk. Most of the actions revolve around having a repeatable process and/or controls. The nice thing about process and controls is, if something does go wrong, you can often reduce the impact, for example, if you breach the GDPR, you can show you treated it seriously and argue for a lesser fine.

Let's look at some of the actions you should consider to manage data compliance risk.

Education

Everyone who handles data needs to go through training. This should include:

Privacy and PII training.
Trading on confidential information.
Rules around handling bought in data.

Initially, everyone needs to be trained, but that training needs to be refreshed every year or so. Of course, new employees must be trained too.

Restricted access/queries

Who has access to data needs to be regulated and controlled. For example, who needs to have access to CRM data? Plainly, the sales and marketing teams and the engineering people supporting the product, but who else? Who should not have access to the data? The first step here is to audit access, the second step is to control access, the third step is to set up a continuous monitoring process.

A piece that's often missed is controlling the nature of queries run on the data. The GDPR limits querying on PII data to areas of legitimate business interest. An analyst may well run "initiative" queries to see if the company could extract more value from the data, and that could be problematic. The solution here is education and supervision.

Encryption

There's an old cybersecurity mantra "encrypt data at reset, encrypt data in transit". Your data needs to be encrypted by an appropriately secure algorithm and not one susceptible to rainbow table or other attacks.

Related to encryption is the idea of pseudonymization. To put it simply, this replaces key PII with a string, e.g. "John Smith" might be replaced with "Qe234-6jDfG-j56da-9M02sd", similarly, we might replace his passport number with a string, his credit card number with a string, his IP address with a string, his account number, and so on. The mapping of this PII data to strings is via a database table with very, very restricted access.

As it turns out, almost all analysis you might want to do on PII data works equally well with pseudonymization. For example, let's say you're a consumer company and you want to know how many customers you have in a city. You don't actually need to know who they are, you just need counts. You can count unique strings just the same as you can count unique names.

There's a lot more to say about this technique, but all I'm going to say now is that you should be using it.

Audit

This is the same as any audit, you go through the organization with a set of questions and checks. An audit is a good idea as an initial activity, but tends to be disruptive. After the initial audit, I favor annual spot checks.

Standards compliance

There are a ton of standards out there covering data compliance: SOC2, NIST, ISO27000, FedRamp, etc. It's highly likely that an organization will have to comply with one or more of them. You could try and deal with many/most compliance issues by conforming to a standard, but be aware that will still leave gaps. The problem with complying to a standard is that the certification becomes the goal rather than reducing risk. Standards are not enough.

Help line

A lot of these issues are hard for technical people to understand. They need ongoing support and guidance. A good idea is to ensure they know who to turn to to get help. This process needs to be quick and easy.

(Something to watch out for is management retaliation. Let's say a senior analyst thinks a use of data breaches legal terms but their manager tells them to do nothing. The analyst reaches out to the legal team who confirms that the intended use is a breach. The manager cannot be allowed to retaliate against the analyst.)

The bottom line

As a technical person, you need to treat this stuff seriously. Assuming "common sense" can get you into a lot of trouble. Make friends with your legal team, they're there to help you.

Tuesday, March 18, 2025

Data science jokes

(An OpenAI generated image of some data scientists laughing. There are two reasons why you know it's fake: they're all beautiful and they're all laughing at these jokes.)

Where do data scientists go unplanned camping?
In a random forest.

Who do they bring on their trip?
Their nearest neighbors.

What do zoo keepers and data scientists have in common?
They both import pandas.

Where do data scientists go camping to get away from it all?
In an isolation forest.

What's the different between ML and AI?
If it's written in Python, then it's probably ML.
If it's written in PowerPoint, then it's probably AI.

A Machine Learning algorithm walks into a bar.
The bartender asks, "What'll you have?"
The algorithm says, "What's everyone else having?"

Data science is 80% preparing data, and 20% complaining about preparing data.

A SQL query walks into a bar, walks up to two tables, and asks, “Can I join you?”

How did the data scientist describe their favorite movie? It had a great training set.

Why do data scientists love parks?
Because of all the natural logs!

What’s the difference between an entomologist and a data scientist?
Entomologists classify bugs. Data scientists remove bugs from their classifiers.

Why did the data set go to therapy?
It had too many issues with its relationships!

Why does Python live on land?
Because it's above C-level.

One of these jokes was generated by OpenAI. Can you tell which one?

Monday, March 10, 2025

Everything you wanted to know about the normal distribution but were afraid to ask

Normal is all around you, and so is not-normal

The normal distribution is the most important statistical distribution. In this blog post, I'm going to talk about its properties, where it occurs, and why it's so very important. I'm also going to talk about how using the normal distribution when you shouldn't can lead to disaster and what you can do about it.

(Ainali, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A rose by any other name

The normal distribution has a number of different names in different disciplines:

Normal distribution. This is the name used by statisticians and data scientists.
Gaussian distribution. This is what physicists call it.
The bell curve. The names used by social scientists and by people who don't understand statistics.

I'm going to call it the normal distribution in this blog post, and I'd advise you to call it this too. Even if you're not a data scientist, using the most appropriate name helps with communication.

What it is and what it looks like

When we're measuring things in the real world, we see different values. For example, if we measure the heights of 10 year old boys in a town, we'd see some tall boys, some short boys, and most boys around the "average" height. We can work out what fraction of boys are a certain height and plot a chart of frequency on the y axis and height on the x axis. This gives us a probability or frequency distribution. There are many, many different types of probability distribution, but the normal distribution is the most important.

(As an aside, you may remember making histograms at school. These are "sort-of" probability distributions. For example, you might have recorded the height of all the children in a class, grouped them into height ranges, counted the number of children in each height range, and plotted the chart. The y axis would have been a count of how many children in that height range. To turn this into a probability distribution, the y axis would become the fraction of all children in that height range. )

Here's what a normal probability distribution looks like. Yes, it's the classic bell curve shape which is exactly symmetrical.

The formula describing the curve is quite complex, but all you need to know for now is that it's described by two numbers: the mean (often written $\mu$) and a standard deviation (often written $\sigma$). The mean tells you where the peak is and the standard deviation gives you a measure of the width of the curve.

To greatly summarize: values near the mean are the most likely to occur and the further you go from the mean, the less likely they are. This lines up with our boys' heights example: there aren't many very short or very tall boys and most boys are around the mean height.

Obviously, if you change the mean or the standard deviation, you change the curve, for example, you can change the location of the mean or you can make the curve wider or narrower. It turns out changing the mean and standard deviation just scales the curve because of its underlying mathematical properties. Most distributions don't behave like this; changing parameters can greatly change the entire shape of the distribution (for example, the beta distribution wildly changes shape if you change its parameters). The normal scaling property has some profound consequences, but for now, I'll just focus on one. We can easily map all normal distributions to one standard normal distribution. Because the properties of the standard normal are known, we can easily do math on the standard normal. To put it another way, it greatly speeds up what we need to do.

Why the normal distribution is so important

Here are some normal distribution examples from the real world.

Let's say you're producing precision bolts. You need to supply 1,000 bolts of a precise specification to a customer. Your production process has some variability. How many bolts do you need to manufacture to get 1,000 good ones? If you can describe the variability using a normal distribution (which is the case for many manufacturing processes), you can work out how many you need to produce.

Imagine you're outfitting an army and you're buying boots. You want to buy the minimum number of boots while still fitting everyone. You know that many body dimensions follow the normal distribution (most famously, chest circumference), so you can make a good estimate of how many boots of different sizes to buy.

Finally, let's say you've bought some random stocks. What might the daily change in value be? Under usual conditions, the change in value follows a normal distribution, so you can estimate what your portfolio might be worth tomorrow.

It's not just these three examples, many phenomena in different disciplines are well described by the normal distribution.

The normal distribution is also common because of something called the central limit theorem (CLT). Let's say I'm taking measurement samples from a population, e.g. measuring the speed of cars on a freeway. The CLT says that the distribution of the sample means will follow a normal distribution regardless of the underlying distribution. In the car speed example, I don't know how the speeds are distributed, but I can calculate a mean and know how certain I am that the mean value is the true (population) mean. This sounds a bit abstract but it has profound consequences in statistics and means that normal distribution comes up time and time again.

Finally, it's important because it's so well-known. The math to describe and use the normal distribution has been known for centuries. It's been written about in hundreds of textbooks in different languages. More importantly, it's very widely taught; almost all numerate degrees will cover it and how to use it.

Let's summarize why it's important:

It comes up in nature, in finance, in manufacturing etc.
It comes up because of the CLT.
The math to use it is standardized and well-known.

What useful things can I do with the normal distribution?

Let's take an example from the insurance world. Imagine an insurance company insures house contents and cars. Now imagine the claim distribution for cars follows a normal distribution and the claims distribution for house contents also follows a normal distribution. Let's say in a typical year the claims distributions look something like this (cars on the left, houses on the right).

(The two charts look identical except for the numbers on the x and y axis. That's expected. I said before that all normal distributions are just scaled versions of the standard normal. Another way of saying this is, all normal distribution plots look the same.)

What does the distribution look like for cars plus houses?

The long winded answer is to use convolution (or even Monte Carlo). But because the house and car distribution are normal, we can just do:

$\mu_{combined} = \mu_{houses} + \mu_{cars} $

$\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2$

So we can calculate the combined distribution in a heartbeat. The combined distribution looks like this (another normal distribution, just with a different mean and standard deviation).

To be clear: this only works because the two distributions were normal.

It's not just adding distributions together. The normal distribution allows for shortcuts if we're multiplying or dividing etc. The normal distribution makes things that would otherwise be hard very fast and very easy.

Some properties of the normal distribution

I'm not going to dig into the math here, but I am going to point out a few things about the distribution you should be aware of.

The "standard normal" distribution goes from $-\infty$ to $+\infty$. The further away you get from the mean, the lower the probability, and once you go several standard deviations away, the probability is quite small, but never-the-less, it's still present. Of course, you can't show $\infty$ on a chart, so most people cut off the x-axis at some convenient point. This might give the misleading impression that there's an upper or lower x-value; there isn't. If your data has upper or lower cut-off values, be very careful modeling it using a normal distribution. In this case, you should investigate other distributions like the truncated normal.

The normal distribution models continuous variables, e.g. variables like speed or height that can have any number of decimal places (but see the my previous paragraph on $\infty$). However, it's often used to model discrete variables (e.g. number of sheep, number of runs scored, etc.). In practice, this is mostly OK, but again, I suggest caution.

Abuses of the normal distribution and what you can do

Because it's so widely known and so simple to use, people have used it where they really shouldn't. There's a temptation to assume the normal when you really don't know what the underlying distribution is. That can lead to disaster.

In the financial markets, people have used the normal distribution to predict day-to-day variability. The normal distribution predicts that large changes will occur with very low probability; these are often called "black swan events". However, if the distribution isn't normal, "black swan events" can occur far more frequently than the normal distribution would predict. The reality is, financial market distributions are often not normal. This creates opportunities and risks. The assumption of normality has lead to bankruptcies.

Assuming normality can lead to models making weird or impossible predictions. Let's say I assume the numbers of units sold for a product is normally distributed. Using previous years' sales, I forecast unit sales next year to be 1,000 units with a standard deviation of 500 units. I then create a Monte Carlo model to forecast next years' profits. Can you see what can go wrong here? Monte Carlo modeling uses random numbers. In the sales forecast example, there's a 2.28% chance the model will select a negative sales number which is clearly impossible. Given that Monte Carlo models often use tens of thousands of simulations, it's extremely likely the final calculation will have been affected by impossible numbers. This kind of mistake is insidious and hard to spot and even experienced analysts make it.

If you're a manager, you need to understand how your team has modeled data.

Ask what distributions they've used to model their data.
Ask them why they've used that distribution and what evidence they have that the data really is distributed that way.
Ask them how they're going to check their assumptions.
Most importantly, ask them if they have any detection mechanism in place to check for deviation from their expected distribution.

History - where the normal came from

Rather unsatisfactorily, there's no clear "Eureka!" moment for the discovery of the distribution, it seems to have been the accumulation of the work of a number of mathematicians. Abraham de Moivre kicked off the process in 1733 but didn't formalize the distribution, leaving Gauss to explicitly describe it in 1801 [https://medium.com/@will.a.sundstrom/the-origins-of-the-normal-distribution-f64e1575de29].

Gauss used the normal distribution to model measurement errors and so predict the path of the asteroid Ceres [https://en.wikipedia.org/wiki/Normal_distribution#History]. This sounds a bit esoteric, but there's a point here that's still relevant. Any measurement taking process involves some form of error. Assuming no systemic bias, these errors are well-modeled by the normal distribution. So any unbiased measurement taking today (e.g opinion polling, measurements of particle mass, measurement of precision bolts, etc.) uses the normal distribution to calculate uncertainty.

In 1810, Laplace placed the normal distribution at the center of statistics by formulating the Central Limit Theorem.

The math

The probability distribution function is given by:

\[f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e ^ {-\frac{1}{2} ( \frac{x - \mu}{\sigma}) ^ 2 }\]

$\sigma$ is the standard deviation and $\mu$ is the mean. In the normal distribution, the mean is the same as the mode is the same as the median.

This formula is almost impossible to work with directly, but you don't need to. There are extensive libraries that will do all the calculations for you.

Adding normally distributed parameters is easy:

$\mu_{combined} = \mu_{houses} + \mu_{cars} $

$\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2$

Wikipedia has an article on how to combine normally distributed quantities, e.g. addition, multiplication etc. see https://en.wikipedia.org/wiki/Propagation_of_uncertainty.

Monday, March 3, 2025

Outliers have more fun

What's an outlier and why should you care?

Years ago I worked for a company that gave me t-shirt that said "Outliers have more fun". I've no idea what it meant, but outliers are interesting, and not in a good way. They'll do horrible things to your data and computing costs if you don't get a handle on them.

Simply put, an outlier is one or more data items that are extremely different from your other data items. Here's a joke that explains the idea:

There's a group of office workers drinking in a bar in Seattle. Bill Gates walks in and suddenly, the entire bar starts celebrating. Why? Because on average, they'd all become multi-millionaires.

Obviously, Bill Gates is the outlier in the data set. In this post, I'm going to explain what outliers do to data and what you can do to protect yourself.

(Jessica Tam, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons)

Outliers and the mean

Let's start with the explaining the joke to death because everyone enjoys that.

Before Bill Gates walks in, there are 10 people in the bar drinking. Their salaries are: $80,000, $81,000, $82,000, $83,000, $84,000, $85,000, $86,000, $87,000, $88,000, and $89,000 giving a mean of $84,500. Let's assume Bill Gates earns $1,000,000,000 a year. Once Bill Gates walks into the bar, the new mean salary is $90,985,909; which is plainly not representative of the bar as a whole. Bill Gates is a massive outlier who's pulled the average way beyond what's representative.

How susceptible your data is to this kind of outlier effect depends on the type and distribution of your data. If your data is scores out of 10, and a "typical" score is 5, the average isn't going to be pulled too far away by an outlier (because the maximum is 10 and the minimum is zero, which are not hugely different from the typical value of 5). If there's no upper or lower limit (e.g salaries, house prices, amount of debt etc.), then you're vulnerable, and you may be even more vulnerable if your distribution is right skewed (e.g. something like a log normal distribution).

What can you do if this is the case? Use the median instead. The median is the middle value. In our Seattle bar example, the median is $84,500 before Bill Gates walks in and $85,000 afterwards. That's not much of a change and is much more representative of the salaries of everyone. This is the reason why you hear "median salaries" reported in government statistics rather than "mean salaries".

If you do use the median, please be aware that it has different mathematical properties from the mean. It's fine as a measure of the average, but if you're doing calculations based on medians, be careful.

Outliers and the standard deviation

The standard deviation is a representation of the spread of the data. The bigger the number, the wider the spread. In our bar example, before Bill Gates walks in, the standard deviation is $2,872. This seems reasonable as the salaries are pretty close together. After Bill Gates walks in, the standard deviation is $287,455,495 which is even bigger than the new mean. This number suggests all the salaries are quite different, which is not the case, only one is.

The standard deviation is susceptible to outliers in the same way the mean is, but for some reason, people often overlook it. I've seen people be very aware of outliers when they're calculating an average, but forget all about it when they're calculating a standard deviation.

What can you do? The answer here's isn't as clear. A good choice is the interquartile range (IQR), but it's not the same measurement. The IQR represents 75% of the data and not the 67% that the standard deviation does. For the bar, the IQR is $4,500 before Bill Gates walks in and $5,000 afterwards. If you want a measure of 'dispersion', the IQR is a good choice, if you want a drop in replacement for the standard deviation, you'll have to give it more thought.

Why the median and IQR are not drop in replacements for the mean and standard deviation

The mean and median are subtly different measures and have different mathematical properties. The same applies to standard deviation and IQR. It's important to understand the trade-offs when you use them.

Combining means is easy, we can do it through formula understood for hundreds of years. But we can't combine medians in the same way; the math doesn't work like that. Here's an example, let's imagine we have two bars, one with 10 drinkers earning a mean of $80,000, the other with 10 drinkers earning a mean of $90,000. The mean across the two bars is $85,000. We can do addition, subtraction, multiplication, division, and other operations with means. But if we know the median of the first bar is $81,000 and the median of the second bar is $89,000, we can't combine them. The same is true of the standard deviation and IQR, there are formula to combine standard deviations, but not IQRs.

In the Seattle bar example, we wanted one number to represent the salaries of the people in the bar. The best average is the median and the best measure of spread is the IQR, the reason being outliers. However, if we wanted an average we could apply across multiple bars, or if we wanted to do some calculations using the average and spread, we'd be better off with the mean and standard deviation.

Of course, it all comes down to knowing what you want and why. Like any job, you've got to know your tools.

The effect of more samples

Sometimes, more data will save you. This is especially true if your data is normally distributed and outliers are very rare. If your data distribution is skewed, it might not help that much. I've worked with some data sets with massive skews and the mean can vary widely depending on how many samples you take. Of course, if you have millions of samples, then you'll mostly be OK.

Outliers and calculation costs

This warning won't apply to everyone. I've built systems where the computing cost depends on the range of the data (maximum - minimum). The bigger the range, the more the cost. Outliers in this case can drive computation costs up, especially if there's a possibility of a "Bill Gates" type effect that can massively distort the data. If this applies to your system, you need to detect outliers and take action.

Final advice

If you have a small sample size (10 or less): use the median and the IQR.

If your data is highly right skewed: use the median and the IQR.

Remember the median and the IQR are not the same as the mean and the standard deviation and be extremely careful using them in calculations.

If your computation time depends on the range of the data, check for outliers.

Thursday, February 13, 2025

Why assuming independence can be very, very bad for business

Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die.

Here are some examples of independent events:

Throwing a coin and getting a head, throwing a dice and getting a two.
Drawing a king from a deck of cards, winning the lottery having bought a ticket.
Stopping at at least one red light on my way to the store, rain falling two months from now.

By contrast, some events are not independent (they're dependent):

Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them. If you have no idea what the relationship is, it's easier to assume there's none.

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, and they assume they can use this process all the time. This is assuming independent through ignorance. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who've made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

Two unrelated people in different towns defaulting on their mortgage at the same time

Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).

A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on. They were independent, at least in part, because there was no common causal mechanism.

When hurricane Andrew hit Florida in 1992, it destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies.

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".

(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption. You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence. Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B).

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

Rolling a dice and getting a 1 and a 2
Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:

$P(A) = 1/6$
$P(B) = 1/6$
$P(A \cap B) = 0$, it's not possible to get 1 and 2 at the same time
$P(A )P(B) = (1/6) * (1/6)$

So the equation $P(A \ and \ B) = P(A \cap B) = P(A) P(B)$ isn't true, therefore the events are not independent.

For case 2, here are the probabilities:

$P(A) = 1/3$
$P(B) = 1/2$
$P(A \cap B) = 1/6$
$P(A )P(B) = (1/2) * (1/3)$

So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:

\[P(A \ and \ B) = P(A \cap B) = P(A | B) P(B)\]

The expression $P(A | B)$ means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding $P(A | B)$, the most popular over the last few years has been Bayes' Theorem which states:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]

There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence.

Monday, February 3, 2025

Using AI (LLM) to generate data science code

What AI offers data science code generation and what it doesn't

Using generative AI for coding support has become increasingly popular for good reason; the productivity gain can be very high. But what are its limits? Can you use code gen for real data science problems?

(I, for one, welcome our new AI overlords. D J Shin, CC BY-SA 3.0 , via Wikimedia Commons)

To investigate, I decided to look at two cases: a 'simple' piece of code generation to build a Streamlit UI, and a technically complicated case that's more typical of data science work. I generated Python code and evaluated it for correctness, structure, and completeness. The results were illuminating, as we'll see, and I think I understand why they came out the way they did.

My setup is pretty standard, I'm using Github copilot in Microsoft Visual Studio and Github Copilot directly from the website. In both cases, I chose the Claude model (more on why later).

Case 1: "commodity" UI code generation

The goal of this experiment was to see if I could automatically generate a good enough complete multi-page Streamlit app. The app was to have multiple dialog boxes on each page and was to be runnable without further modification.

Streamlit provides a simple UI for Python programs. It's several years old and extremely popular (meaning, there are plenty of code examples in Github). I've built apps using Streamlit, so I'm familiar with it and its syntax.

The specification

The first step was a written English specification. I wrote a one-page Word document detailing what I wanted for every page of the app. I won't reproduce it here for brevity's sake, but here's a brief except:

The second page is called “Load model”. This will allow the user to load an existing model from a file. The page will have some descriptive text on what the page does. There will be a button that allows a user to load a file. The user will only be able to load a single with a file extension “.mdl”. If the user successfully loads a model, the code will load it into a session variable that the other pages can access. The “.mdl” file will be a JSON file and the software will check that the file is valid and follows some rules. The page will tell the user if the file has been successfully loaded or if there’s an error. If there’s an error, the page will tell the user what the error is.

In practice, I had to iterate on the specification a few times to get things right, but it only a took a couple of iterations.

What I got

Code generation was very fast and the results were excellent. I was able to run the application immediately without modification and it did what I wanted it to do.

(A screen shot of part of the generated Streamlit app.)

It produced the necessary Python files, but it also produced:

a requirements.txt file - which was correct
a dummy JSON file for my data, inferred from my description
data validation code
test code

I didn't ask for any of these things, it just produced them anyway.

There were several downsides though.

I found the VS Code interface a little awkward to use, for me the Github Copilot web page was a much better experience (except that you have to copy the code).

Slight changes to my specification sometimes caused large changes to the generated code. For example, I added a sentence asking for a new dialog box and the code generation incorrectly dropped a page from my app.

It seemed to struggle with long "if-then" type paragraphs, for example "If the user has loaded a model ...LONG TEXT... If the user hasn't loaded a model ...LONG TEXT...".

The code was quite old-fashioned in several ways. Code generation created the app pages in a pages folder and prefixed the pages with "1_", "2_" etc. This is how the demos on the Streamlit website are structured, but it's not how I would do it, it's kind of old school and a bit limited. Notably, the code generation didn't use some of the newer features of Streamlit; on the whole it was a year or so behind the curve.

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I'm convinced by code generation here. Yes, it was a little behind the times and a little awkwardly structured, but it worked and it gave me something very close to what I wanted within a few minutes.

I could have written this myself (and I have done before), but I find this kind of coding tedious and time consuming (it would have taken me a day to do what I did using code gen in an hour).

I will be using code gen for this type of problem in the future.

Case 2: data science code generation

What about a real data science problem, how well does it perform?

I chose to use random variables and quasi-Monte Carlo as something more meaty. The problem was to create two random variables and populate them with samples drawn from a quasi-Monte Carlo "random" number generator with a normal distribution. For each variable, work out the distribution (which we know should be normal). Combine the variables with convolution to create a third variable, and plot the resulting distribution. Finally, calculate the mean and standard deviation of all three variables.

The specification

I won't show it here for brevity, but it was a slightly longer than the description I gave above. Notably, I had to iterate on it several times.

What I got

This was a real mixed bag.

My first pass code generation didn't use quasi Monte Carlo at all. It normalized the distributions before the convolution for no good reason which meant the combined result was wrong. It used a histogram for the distribution which was kind-of OK. It did generate the charts just fine though. Overall, it was the kind of work a junior data scientist might produce.

On my second pass, I told it to use Sobel' sequences and I told it to use kernel density estimation to calculate the distribution. This time it did very well. The code was nicely commented too. Really surprisingly, it used the correct way of generating sequences (using dimensions).

(After some prompting, this was my final chart, which is correct.)

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I had to be much more prescriptive here to get what I wanted, but the results were good, but only because I knew to tell it to use Sobel' and I knew to tell it to use kernel density estimation.

Again, I'm convinced that code gen works.

Observations

The model

I tried the experiment with both Claude 3.5 and GPT 4o. Claude gave much better results. Other people have reported similar experiences.

Why this works and some fundamental limitations

Github has access to a huge code base, so the LLM is based on the collective wisdom of a vast number of programmers. However, despite appearances, it has no insight; it can't go beyond what others have done. This is why the code it produced for the Streamlit demo was old-fashioned. It's also why I had to be prescriptive for my data science case, for example, it just didn't understand what quasi Monte Carlo meant without additional prompting.

AI is known to hallucinate, and we see see something of that here. You really have to know what you're doing to use AI generated code. If you blindly implement AI generated code, things are going to go badly for you very quickly.

Productivity

Code generation and support is a game changer. It ramps up productivity enormously. I've heard people say, it's like having a (free) senior engineer by your side. I agree. Despite the issues I've come across, code generation works "good enough".

Employment

This has obvious implications for employment. With AI code generation and with AI coding support, you need fewer software engineers/analysts/data scientists. The people you do need are those with more insight and the ability spot where the AI generated code has gone wrong, which is bad news for for more junior people or those entering the workforce. It may well be a serious problem for students seeking internships.

Let me say this plainly: people will lose their jobs because of this technology.

My take on the employment issue and what you can do

There's an old joke that sums things up. "A householder calls in a mechanic because their washing machine had broken down. The mechanic looks at the washing machine and rocks it around a bit. Then the mechanic kicks the machine. It starts working! The mechanic writes a bill for $200. The householder explodes, '$200 to kick a washing machine, this is outrageous!'. The mechanic thinks for a second and says, 'You're quite right. Let me re-write the bill'. The new bill says 'Kicking the washing machine $1, knowing where to kick the washing machine $199'." To put it bluntly, you need to be the kind of mechanic that knows where to kick the machine.

(You've got to know where to kick it. LG전자, CC BY 2.0 , via Wikimedia Commons)

Code generation has no insight. It makes errors. You have to have experience and insight to know when it's gone wrong. Not all human software engineers have that insight.

You should be very concerned if:

You're junior in your career or you're just entering the workforce.
You're developing BI-type apps as the main or only thing you do.
There are many people doing exactly the same software development work as you.

If that applies to you, here's my advice:

Use code generation and code support. You need to know first hand what it can do and the threat it poses. Remember, it's a productivity boost and the least productive people are the first to go.
Develop domain knowledge. If your company is in the finance industry, make sure you understand finance, which means knowing the legal framework etc.. If it's a drug discovery, learn the principles of drug discovery. Get some kind of certification (online courses work fine). Apply your knowledge to your work. Make sure your employer knows it.
Develop specialist skills, e.g. statistics. Use those skills in your work.
Develop human skills. This means talking to customers, talking to people in other departments.

Some takeaways

AI generated code is good enough for use, even in more complicated cases.
It's a substantial productivity boost. You should be using it.
It's a tool, not a magic wand. It does get things wrong and you need to be skilled enough to spot errors.

Friday, January 24, 2025

Python formatting

Python string formatters

I use Python and I output data for reports, which means I need to format strings precisely. I find the string formatters hard to use and resources to explain them are scattered over the web. So I decided to write up my own guide to using formatters. This is mainly for me to have a 'cheat sheet', but I hope you find some use for it too. Of course, I've liberally copied and pointed to the Python documentation.

(This is a Python blog post! Image source: Wikimedia Commons. License: Creative Commons.)

Overview

Python string formatters have this general form:

{identifier : format specifier}

The term identifier is something I made up for easier reference.

Identifiers

The identifier ties the string format to the code in the format statement. Identifiers can be positional (numbered) or named. Numbered identifiers must start from zero and should increment by 1, but you can re-use identifiers like this:

'Today is {0}-{1}-{2} the year is {0}'.format(2020, 10, 22)

You can also use names as identifiers:

'Today is {year}-{month}-{day} the year is {year}'.format(year=2020, month=10, day=22)

and relatedly, you can use a dict:

'Today is {year}-{month}-{day} the year is {year}'.format(**date)

You can read more clever uses of identifiers here: https://docs.python.org/3.4/library/string.html#format-string-syntax

Format specifiers

There's an entire format specifier mini language: https://docs.python.org/3.4/library/string.html#formatspec

The general form is:

[[fill]align][sign][#][0][width][,][.precision][type]

fill is the character to use to fill padded spaces
align is the instruction on how to align the string (left, center, right)
sign, the + or - sign, only makes sense for numbers
# indicates an alternate form for conversion
0 - used for sign aware zero padding for numbers
width - the width in characters of the field
, - use of the thousand separator
.precision - the number of digits after the decimal place
type - one of these special types: "b", "c", "d", "e", "E", "f", "F", "g", "G", "n", "o", "s", "x", "X", "%"

Type Meaning

's' String format. This is the default type for strings and may be omitted.

None The same as 's'.

Type	Meaning
`'s'`	String format. This is the default type for strings and may be omitted.
None	The same as `'s'`.

Type Meaning

'b' Binary format. Outputs the number in base 2.

'c' Character. Converts the integer to the corresponding unicode character before printing.

'd' Decimal Integer. Outputs the number in base 10.

'o' Octal format. Outputs the number in base 8.

'x' Hex format. Outputs the number in base 16, using lower- case letters for the digits above 9.

'X' Hex format. Outputs the number in base 16, using upper- case letters for the digits above 9.

'n' Number. This is the same as 'd', except that it uses the current locale setting to insert the appropriate number separator characters.

None The same as 'd'.

Type	Meaning
`'b'`	Binary format. Outputs the number in base 2.
`'c'`	Character. Converts the integer to the corresponding unicode character before printing.
`'d'`	Decimal Integer. Outputs the number in base 10.
`'o'`	Octal format. Outputs the number in base 8.
`'x'`	Hex format. Outputs the number in base 16, using lower- case letters for the digits above 9.
`'X'`	Hex format. Outputs the number in base 16, using upper- case letters for the digits above 9.
`'n'`	Number. This is the same as `'d'`, except that it uses the current locale setting to insert the appropriate number separator characters.
None	The same as `'d'`.

Type Meaning

'e' Exponent notation. Prints the number in scientific notation using the letter ‘e’ to indicate the exponent. The default precision is 6.

'E' Exponent notation. Same as 'e' except it uses an upper case ‘E’ as the separator character.

'f' Fixed point. Displays the number as a fixed-point number. The default precision is 6.

'F' Fixed point. Same as 'f', but converts nan to NAN and inf to INF.

'g'
General format. For a given precision p >= 1, this rounds the number to p significant digits and then formats the result in either fixed-point format or in scientific notation, depending on its magnitude.

The precise rules are as follows: suppose that the result formatted with presentation type 'e' and precision p-1 would have exponent exp. Then if -4 <= exp < p, the number is formatted with presentation type 'f' and precision p-1-exp. Otherwise, the number is formatted with presentation type 'e' and precision p-1. In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it.

Positive and negative infinity, positive and negative zero, and nans, are formatted as inf, -inf, 0, -0 and nan respectively, regardless of the precision.

A precision of 0 is treated as equivalent to a precision of 1. The default precision is 6.

'G' General format. Same as 'g' except switches to 'E' if the number gets too large. The representations of infinity and NaN are uppercased, too.

'n' Number. This is the same as 'g', except that it uses the current locale setting to insert the appropriate number separator characters.

'%' Percentage. Multiplies the number by 100 and displays in fixed ('f') format, followed by a percent sign.

None Similar to 'g', except that fixed-point notation, when used, has at least one digit past the decimal point. The default precision is as high as needed to represent the particular value. The overall effect is to match the output of str() as altered by the other format modifiers.

Type	Meaning
`'e'`	Exponent notation. Prints the number in scientific notation using the letter ‘e’ to indicate the exponent. The default precision is `6`.
`'E'`	Exponent notation. Same as `'e'` except it uses an upper case ‘E’ as the separator character.
`'f'`	Fixed point. Displays the number as a fixed-point number. The default precision is `6`.
`'F'`	Fixed point. Same as `'f'`, but converts `nan` to `NAN` and `inf` to `INF`.
`'g'`	General format. For a given precision `p >= 1`, this rounds the number to `p` significant digits and then formats the result in either fixed-point format or in scientific notation, depending on its magnitude. The precise rules are as follows: suppose that the result formatted with presentation type `'e'` and precision `p-1` would have exponent `exp`. Then if `-4 <= exp < p`, the number is formatted with presentation type `'f'` and precision `p-1-exp`. Otherwise, the number is formatted with presentation type `'e'` and precision `p-1`. In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it. Positive and negative infinity, positive and negative zero, and nans, are formatted as `inf`, `-inf`, `0`, `-0` and `nan` respectively, regardless of the precision. A precision of `0` is treated as equivalent to a precision of `1`. The default precision is `6`.
`'G'`	General format. Same as `'g'` except switches to `'E'` if the number gets too large. The representations of infinity and NaN are uppercased, too.
`'n'`	Number. This is the same as `'g'`, except that it uses the current locale setting to insert the appropriate number separator characters.
`'%'`	Percentage. Multiplies the number by 100 and displays in fixed (`'f'`) format, followed by a percent sign.
None	Similar to `'g'`, except that fixed-point notation, when used, has at least one digit past the decimal point. The default precision is as high as needed to represent the particular value. The overall effect is to match the output of `str()` as altered by the other format modifiers.

f-strings

These are a way of simplifying Python string formatting and really should be your preferred way of outputting strings. Very usefully, they allow you to embed expressions. Here are a couple of examples.

number = 10

print(f"The number is {number}")

print(f"The expression is {number + 100}")

110

Examples

'{:<30}'.format('left aligned')

'{:*^30}'.format('centered')

"int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)

'{:,}'.format(1234567890)

'Correct answers: {:.2%}'.format(points/total)