Thursday, February 13, 2025

Why assuming independence can be very, very bad for business

Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die. 

Here are some examples of independent events:

  • Throwing a coin and getting a head, throwing a dice and getting a two.
  • Drawing a king from a deck of cards, winning the lottery having bought a ticket.
  • Stopping at at least one red light on my way to the store, rain falling two months from now.
By contrast, some events are not independent (they're dependent):
  • Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
  • Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
  • Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them.  For example, we know that smoking increases the risk of lung cancer (and therefore a life insurance claim), so how should an actuary price that risk? If they price it too low, the insurance company will pay out too much in claims, if they price it too high, the insurance company will lose business to cheaper competitors. In the early days when the link between smoking and cancer was discovered, how could an actuary know how to model the relationship?

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, which is misleading at best. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

  • Two unrelated people in different towns defaulting on their mortgage at the same time
  • Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

  • A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).
  • A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on. 

When hurricane Andrew hit Florida in 1992, it  destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies. 

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.  

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".


Financial Crisis Newspaper
(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption.  You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence.  Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B). 

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

  1. Rolling a dice and getting a 1 and a 2
  2. Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:
  • \(P(A) = 1/6\)
  • \(P(B) = 1/6\)
  • \(P(A  \cap B) = 0\), it's not possible to get 1 and 2 at the same time
  • \(P(A )P(B) = (1/6) * (1/6)\)
So the equation \(P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\) isn't true, therefore the events are not independent.

For case 2, here are the probabilities:
  • \(P(A) = 1/3\)
  • \(P(B) = 1/2\)
  • \(P(A  \cap B) = 1/6\)
  • \(P(A )P(B) = (1/2) * (1/3)\)
So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:
\[P(A \ and \ B) = P(A  \cap B) = P(A | B) P(B)\]
The expression \(P(A | B)\) means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding \(P(A | B)\), the most popular over the last few years has been Bayes' Theorem which states:
\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]
There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence. 

Monday, February 3, 2025

Using AI (LLM) to generate data science code

What AI offers data science code generation and what it doesn't

Using generative AI for coding support has become increasingly popular for good reason; the productivity gain can be very high. But what are its limits? Can you use code gen for real data science problems?

(I, for one, welcome our new AI overlords. D J Shin, CC BY-SA 3.0 , via Wikimedia Commons)

To investigate, I decided to look at two cases: a 'simple' piece of code generation to build a Streamlit UI, and a technically complicated case that's more typical of data science work. I generated Python code and evaluated it for correctness, structure, and completeness. The results were illuminating, as we'll see, and I think I understand why they came out the way they did.

My setup is pretty standard, I'm using Github copilot in Microsoft Visual Studio and Github Copilot directly from the website. In both cases, I chose the Claude model (more on why later).

Case 1: "commodity" UI code generation

The goal of this experiment was to see if I could automatically generate a good enough complete multi-page Streamlit app. The app was to have multiple dialog boxes on each page and was to be runnable without further modification.

Streamlit provides a simple UI for Python programs. It's several years old and extremely popular (meaning, there are plenty of code examples in Github). I've built apps using Streamlit, so I'm familiar with it and its syntax. 

The specification

The first step was a written English specification. I wrote a one-page Word document detailing what I wanted for every page of the app. I won't reproduce it here for brevity's sake, but here's a brief except:

The second page is called “Load model”. This will allow the user to load an existing model from a file. The page will have some descriptive text on what the page does. There will be a button that allows a user to load a file. The user will only be able to load a single with a file extension “.mdl”. If the user successfully loads a model, the code will load it into a session variable that the other pages can access. The “.mdl” file will be a JSON file and the software will check that the file is valid and follows some rules. The page will tell the user if the file has been successfully loaded or if there’s an error. If there’s an error, the page will tell the user what the error is.

In practice, I had to iterate on the specification a few times to get things right, but it only a took a couple of iterations.

What I got

Code generation was very fast and the results were excellent. I was able to run the application immediately without modification and it did what I wanted it to do.

(A screen shot of part of the generated Streamlit app.)

It produced the necessary Python files, but it also produced:

  • a requirements.txt file - which was correct
  • a dummy JSON file for my data, inferred from my description
  • data validation code
  • test code

I didn't ask for any of these things, it just produced them anyway.

There were several downsides though. 

I found the VS Code interface a little awkward to use, for me the Github Copilot web page was a much better experience (except that you have to copy the code).

Slight changes to my specification sometimes caused large changes to the generated code. For example, I added a sentence asking for a new dialog box and the code generation incorrectly dropped a page from my app. 

It seemed to struggle with long "if-then" type paragraphs, for example "If the user has loaded a model ...LONG TEXT... If the user hasn't loaded a model ...LONG TEXT...".

The code was quite old-fashioned in several ways. Code generation created the app pages in a pages folder and prefixed the pages with "1_", "2_" etc. This is how the demos on the Streamlit website are structured, but it's not how I would do it, it's kind of old school and a bit limited. Notably, the code generation didn't use some of the newer features of Streamlit; on the whole it was a year or so behind the curve.

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I'm convinced by code generation here. Yes, it was a little behind the times and a little awkwardly structured, but it worked and it gave me something very close to what I wanted within a few minutes.

I could have written this myself (and I have done before), but I find this kind of coding tedious and time consuming (it would have taken me a day to do what I did using code gen in an hour). 

I will be using code gen for this type of problem in the future.

Case 2: data science code generation

What about a real data science problem, how well does it perform?

I chose to use random variables and quasi-Monte Carlo as something more meaty. The problem was to create two random variables and populate them with samples drawn from a quasi-Monte Carlo "random" number generator with a normal distribution. For each variable, work out the distribution (which we know should be normal). Combine the variables with convolution to create a third variable, and plot the resulting distribution. Finally, calculate the mean and standard deviation of all three variables.

The specification

I won't show it here for brevity, but it was a slightly longer than the description I gave above. Notably, I had to iterate on it several times.

What I got

This was a real mixed bag.

My first pass code generation didn't use quasi Monte Carlo at all. It normalized the distributions before the convolution for no good reason which meant the combined result was wrong. It used a histogram for the distribution which was kind-of OK. It did generate the charts just fine though. Overall, it was the kind of work a junior data scientist might produce.

On my second pass, I told it to use Sobel' sequences and I told it to use kernel density estimation to calculate the distribution. This time it did very well. The code was nicely commented too. Really surprisingly, it used the correct way of generating sequences (using dimensions).

(After some prompting, this was my final chart, which is correct.)

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I had to be much more prescriptive here to get what I wanted, but the results were good, but only because I knew to tell it to use Sobel' and I knew to tell it to use kernel density estimation. 

Again, I'm convinced that code gen works.

Observations

The model

I tried the experiment with both Claude 3.5 and GPT 4o. Claude gave much better results. Other people have reported similar experiences.

Why this works and some fundamental limitations

Github has access to a huge code base, so the LLM is based on the collective wisdom of a vast number of programmers. However, despite appearances, it has no insight; it can't go beyond what others have done. This is why the code it produced for the Streamlit demo was old-fashioned. It's also why I had to be prescriptive for my data science case, for example, it just didn't understand what quasi Monte Carlo meant without additional prompting.

AI is known to hallucinate, and we see see something of that here. You really have to know what you're doing to use AI generated code. If you blindly implement AI generated code, things are going to go badly for you very quickly.

Productivity

Code generation and support is a game changer. It ramps up productivity enormously. I've heard people say, it's like having a (free) senior engineer by your side. I agree. Despite the issues I've come across, code generation works "good enough".

Employment

This has obvious implications for employment. With AI code generation and with AI coding support, you need fewer software engineers/analysts/data scientists. The people you do need are those with more insight and the ability spot where the AI generated code has gone wrong, which is bad news for for more junior people or those entering the workforce. It may well be a serious problem for students seeking internships.

Let me say this plainly: people will lose their jobs because of this technology.

My take on the employment issue and what you can do

There's an old joke that sums things up. "A householder calls in a mechanic because their washing machine had broken down. The mechanic looks at the washing machine and rocks it around a bit. Then the mechanic kicks the machine. It starts working! The mechanic writes a bill for $200. The householder explodes, '$200 to kick a washing machine, this is outrageous!'. The mechanic thinks for a second and says, 'You're quite right. Let me re-write the bill'. The new bill says 'Kicking the washing machine $1, knowing where to kick the washing machine $199'." To put it bluntly, you need to be the kind of mechanic that knows where to kick the machine.


(You've got to know where to kick it. LG전자, CC BY 2.0 , via Wikimedia Commons)

Code generation has no insight. It makes errors. You have to have experience and insight to know when it's gone wrong. Not all human software engineers have that insight.

You should be very concerned if:
  • You're junior in your career or you're just entering the workforce.
  • You're developing BI-type apps as the main or only thing you do.
  • There are many people doing exactly the same software development work as you.
If that applies to you, here's my advice:
  • Use code generation and code support. You need to know first hand what it can do and the threat it poses. Remember, it's a productivity boost and the least productive people are the first to go.
  • Develop domain knowledge. If your company is in the finance industry, make sure you understand finance, which means knowing the legal framework etc.. If it's a drug discovery, learn the principles of drug discovery. Get some kind of certification (online courses work fine). Apply your knowledge to your work. Make sure your employer knows it.
  • Develop specialist skills, e.g. statistics. Use those skills in your work.
  • Develop human skills. This means talking to customers, talking to people in other departments.

Some takeaways

  • AI generated code is good enough for use, even in more complicated cases.
  • It's a substantial productivity boost. You should be using it.
  • It's a tool, not a magic wand. It does get things wrong and you need to be skilled enough to spot errors.

Friday, January 24, 2025

Python formatting

Python string formatters

I use Python and I output data for reports, which means I need to format strings precisely. I find the string formatters hard to use and resources to explain them are scattered over the web. So I decided to write up my own guide to using formatters. This is mainly for me to have a 'cheat sheet', but I hope you find some use for it too. Of course, I've liberally copied and pointed to the Python documentation.

(This is a Python blog post! Image source: Wikimedia Commons. License: Creative Commons.)

Overview

Python string formatters have this general form:

{identifier : format specifier}

The term identifier is something I made up for easier reference.

Identifiers

The identifier ties the string format to the code in the format statement. Identifiers can be positional (numbered) or named. Numbered identifiers must start from zero and should increment by 1, but you can re-use identifiers like this:

'Today is {0}-{1}-{2} the year is {0}'.format(2020, 10, 22)

You can also use names as identifiers:

'Today is {year}-{month}-{day} the year is {year}'.format(year=2020, month=10, day=22)

and relatedly, you can use a dict:

'Today is {year}-{month}-{day} the year is {year}'.format(**date)

You can read more clever uses of identifiers here: https://docs.python.org/3.4/library/string.html#format-string-syntax

Format specifiers

There's an entire format specifier mini language: https://docs.python.org/3.4/library/string.html#formatspec

The general form is:
[[fill]align][sign][#][0][width][,][.precision][type]
  • fill is the character to use to fill padded spaces
  • align is the instruction on how to align the string (left, center, right)
  • sign, the + or - sign, only makes sense for numbers
  • # indicates an alternate form for conversion
  • 0 - used for sign aware zero padding for numbers
  • width - the width in characters of the field
  • , - use of the thousand separator
  • .precision - the number of digits after the decimal place
  • type - one of these special types: "b", "c", "d", "e", "E", "f", "F", "g", "G", "n", "o", "s", "x", "X", "%"
Type Meaning
's' String format. This is the default type for strings and may be omitted.
None The same as 's'.
Type Meaning
'b' Binary format. Outputs the number in base 2.
'c' Character. Converts the integer to the corresponding unicode character before printing.
'd' Decimal Integer. Outputs the number in base 10.
'o' Octal format. Outputs the number in base 8.
'x' Hex format. Outputs the number in base 16, using lower- case letters for the digits above 9.
'X' Hex format. Outputs the number in base 16, using upper- case letters for the digits above 9.
'n' Number. This is the same as 'd', except that it uses the current locale setting to insert the appropriate number separator characters.
None The same as 'd'.
Type Meaning
'e' Exponent notation. Prints the number in scientific notation using the letter ‘e’ to indicate the exponent. The default precision is 6.
'E' Exponent notation. Same as 'e' except it uses an upper case ‘E’ as the separator character.
'f' Fixed point. Displays the number as a fixed-point number. The default precision is 6.
'F' Fixed point. Same as 'f', but converts nan to NAN and inf to INF.
'g'

General format. For a given precision p >= 1, this rounds the number to p significant digits and then formats the result in either fixed-point format or in scientific notation, depending on its magnitude.

The precise rules are as follows: suppose that the result formatted with presentation type 'e' and precision p-1 would have exponent exp. Then if -4 <= exp < p, the number is formatted with presentation type 'f' and precision p-1-exp. Otherwise, the number is formatted with presentation type 'e' and precision p-1. In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it.

Positive and negative infinity, positive and negative zero, and nans, are formatted as inf-inf0-0 and nan respectively, regardless of the precision.

A precision of 0 is treated as equivalent to a precision of 1. The default precision is 6.

'G' General format. Same as 'g' except switches to 'E' if the number gets too large. The representations of infinity and NaN are uppercased, too.
'n' Number. This is the same as 'g', except that it uses the current locale setting to insert the appropriate number separator characters.
'%' Percentage. Multiplies the number by 100 and displays in fixed ('f') format, followed by a percent sign.
None Similar to 'g', except that fixed-point notation, when used, has at least one digit past the decimal point. The default precision is as high as needed to represent the particular value. The overall effect is to match the output of str() as altered by the other format modifiers.

f-strings

These are a way of simplifying Python string formatting and really should be your preferred way of outputting strings. Very usefully, they allow you to embed expressions. Here are a couple of examples.

number = 10
print(f"The number is {number}")
10
print(f"The expression is {number + 100}")
110

Examples

'{:<30}'.format('left aligned')
'{:*^30}'.format('centered')
"int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)
'{:,}'.format(1234567890)
'Correct answers: {:.2%}'.format(points/total)


Monday, August 7, 2023

How to run webinars

Why do you care about running webinars?

For sales and marketing people, the answer is obvious, to generate leads and do your job. For technical people, things are a little murkier, and as a result, technical people sometimes make avoidable mistakes when they give webinars.

In this blog post, I'll explain how and why a technical person should approach and run webinars. At the end, I'll link to a longer report where I go through the entire process from start to finish.

My experiences

I've run webinars in big companies and small companies and I've had my share of problems. I've faced visual and audio issues, planning issues, marketing issues and on and on. I've learned from what went wrong so I can advise you on what to do.  Here's my summary advice: make sure you understand the whole process end-to-end so you can step in to fix any shortcomings.

What value do you bring?

Why should anyone come to your webinar? 

The marketing department may have asked you to do a webinar, but frankly, they're not going to answer this question for you. If it isn't clear why anyone should attend your webinar, then you're not going to get a good audience. Webinars are not free to attend: they cost your attendees their time, which is extremely valuable. To justify spending someone's time, here are some questions you should ask:

  • who should attend?
  • what will they learn?
  • what will they take away?

Before you do anything, you need to be clear on these points.

Let's take an example of where engineers fall down: webinars for new minor releases. The marketing team wants a webinar on the new release with the goal of increasing leads. The problem is, the new release is a minor one, really only of interest to existing customers. Unfortunately, the engineering team will only commit resources to a release webinar, so that's what gets scheduled. This is a common siuation and the irreconcilable conflict of goals and resources and will lead to the webinar failing. In this case, the engineers and the marketing team need to discuss what's really needed, perhaps there are two webinars, one focused on the new functionality for existing customers and a new webinar on the product overall for prospects. It needs an honest discussion in the company.

I go into this in a lot more detail in my report.

Is the marketing in order?

In almost all cases, the goal of a webinar is to generate sales leads. Usual measures of success are leads generated or sales contributions. To be successful then, the marketing behind the webinar must be effective. This means:

  • a clear and unambiguous value proposition
  • a compelling summary
  • a clearly defined market demographic (e.g. the job titles and organizations you want to reach)
  • an effective recruitment campaign (registration page, social media outreach, email etc.)
  • a compelling call to action at the end of the webinar (e.g. register for more content)

If some or all of these steps are missing, the webinar will be a disappointment.

These steps are usually under the control of the marketing department, but I've done webinars where some or all of these steps were missing and the results were'n't good. Even if you're a completely technical person, you need to ensure that the marketing for your webinar is effective.

Does the webinar have a good story?

This means the webinar must tell a compelling story and have a consistent narrative with a begining, middle, and end. It should finish with a clear and unambiguous call to action.

A good test of whether you have a good story is the 30 second summary. Summarize your webinar in a 30 second pitch. Does it sound good? If not, try again.

Is the audio-visual setup good enough?

Some of this is obvious, but some of it isn't. Audio filtering can clean up some background noises, but not others, for example, you can't filter out echoes. Here's my checklist:

  • Good quality microphone plus a good pop filter - the pop filter is extremely important. 
  • Record your webinar in an acoustically quiet environment. This means few background noises and as much sound deadening material as possible. A bedroom is good place to record a webinar because all the soft furnishings help deaden noise.
  • Make sure your demos work end-to-end. If at all possible, pre-record them and play the recording out during the webinar (but be careful about the technology).

Duration and Q&A

Don't do more than 25 minutes and stick to your schedule. Don't overrun. Leave your audience wanting more, which means you can offer more material as a follow-up (and excuse for more interaction and selling).

Don't leave Q&A to chance. Have a set of canned questons and answers ready to go if your audience is slow to ask questions or the questions are ones you don't want to answer.

The complete guide

This is a small taster of what you have to do to make a webinar succesful. I've expanded a lot on my thoughts and written a comprehensive guide, covering everything from microphone selection to landing pages. You can get my guide by clicking on the link below.



Thursday, August 3, 2023

Using ChatGPT for real to interpret text

What's real and what isn't with ChatGPT?

There's a huge amount of hype surrounding ChatGPT and I've heard all kinds of "game changing" stories around it. But what's real and what's not?

In this blog post, I'm going to show you one of the real things ChatGPT can do: extract meaning from text. I'll show you how well it performs, discuss some of its shortcomings, and highlight important considerations for using it in business. I'm going to do it with real code and real data.

We're going to use ChatGPT to extract meaning from news articles, specifically, two articles on the Women's World Cup.

D J Shin, CC BY-SA 3.0, via Wikimedia Commons. I for one, welcome our new robot overlords...

The Women's World Cup

At the time of writing, the Women's World Cup is in full swing and England have just beaten China 6-1. There were plenty of news stories about it, so I took just two and tried to extract structured, factual data from the articles.

Here are the two articles:

Here is the data I wanted to pull out of the text:
  • The sport being played
  • The competition
  • The names of the teams
  • Who won
  • The score
  • The attendance
I wanted it in a structured format, in this case, JSON.

Obviously, you could read the articles and just extract the information, but the value of ChatGPT is doing this at scale, to scan thousands or millions of articles to search for key pieces of information. Up until now, this has been done by paying people in the developing world to read articles and extract data. ChatGPT offers the prospect of slashing the cost of this kind of work and making it widely available.

Let's see it in action.

Getting started

This example is all in Python and I'm assuming you have a good grasp of the language.

Download the OpenAI library:

pip install openai

Register for OpenAI and get an API key. At the time of writing, you get $5 in free credits and this tutorial won't consume much of that $5.

You'll need to set your API key in your code. To get going, we'll just paste it into our Python file:

import openai
openai.api_key = "YOUR_KEY"

You should note that OpenAI will rescind any keys they find on the public internet. My use of the key in code is very sloppy from a security point of view. Only do it to get started.

Some ChatGPT basics

We're going to focus on just one part of ChatGPT, the ChatCompletion API. Because there's some complexity here, I'm going to go through some of the background before diving into the code.

To set the certainty of its answers, ChatGPT has a concept of "temperature". This is a parameter that sets how "sure" the answer is; the lower the number the more sure the answer. A more certain answer comes at the price of creativity, so for some applications, you might want to choose a higher temperature (for example, you might want a higher temperature for a chatbot). The temperature range is 0 to 1, and we'll use 0 for this example because we want highly reliable analysis.

There are several ChatGPT models each with a different pricing structure. As you might expect, the larger and more recent models are more expensive, so for this tutorial, I'm going to use an older and cheaper model, "gpt-3.5-turbo", that works well enough to show what ChatGPT can do.

ChatGPT works on a model of "roles" and "messages". Roles are the actors in a chat; for a chatbot there will be a "user" role, which is the human entering text, an "assistant" role which is the chat response, and a "system" role controlling the assistant. Messages are the text from the user or the assistant or a "briefing" for the system. For a chatbot, we need multiple messages, but to extract meaning from text, we just need one. To analyze the World Cup articles, we only need the user role.

To get an answer, we need to pose a question or give ChatGPT an instruction on what to do. That's part of the "content" we set in the messages parameter. The content must contain the text we want to analyze and instructions on what we want returned. This is a bigger topic and I'm going to dive into it next.

Prompt engineering part 1

Setting the prompt correctly is the core of ChatGBP and it's a bit of an art, which is why it's been called prompt engineering. You have to very carefully write your prompt to get the results you expect.

Oddly, ChatGPT doesn't separate the text from the query; they're all bundled together in the same prompt. This means you have to clearly tell ChatGPT what you want to analyze and how you want it analyzed.

Let's start with a simple example, let's imagine you want to know how many times the letter "e" occurs in the text "The kind old elephant." Here's how you might write the prompt:

f"""In the following text, how often does the letter e occur:

"The kind old elephant"

"""

This gives us the correct answer (3). We'll come back to this prompt later because it shows some of the pitfalls of working with ChatGPT. In general, we need to be crystal clear about the text we want the system to analyze.

Let's say we wanted the result in JSON, here's how we might write the prompt:

f"""

In the following text, how often does the letter e occur, write your answer as JSON:

"The kind old elephant"

"""

Which gives us {"e": 3}

We can ask more complex questions about some text, but we need to very carefully layout the query and distinguish between text and questions. Here's an example.

prompt = f"""

In the text indicated by three back ticks answer the \

following questions and output your answer as JSON \

using the key names indicated by the word "key_name" \

1) how often does the letter e occur key_name = "letter" \

2) what animal is referred to key_name = "animal" \

```The kind old elephant```

"""

Using ChatGPT

Let's put what we've learned together and build a ChatGPT query to ask questions about the Women's World Cup. Here's the code using the BBC article.

world = """

Lauren James produced a sensational individual 

performance as England entertained to sweep aside 

China and book their place in the last 16 of the 

Women's World Cup as group winners.


It was a display worthy of their status as European 

champions and James once again lit the stage alight 

in Adelaide with two sensational goals and three assists.


The 13,497 in attendance were treated to a masterclass 

from Chelsea's James, who announced her arrival at the 

World Cup with the match-winner against Denmark on Friday.


She helped England get off to the perfect start when 

she teed up Alessia Russo for the opener, and 

later slipped the ball through to Lauren Hemp to 

coolly place it into the bottom corner.


It was largely one-way traffic as England dominated 

and overwhelmed, James striking it first time into 

the corner from the edge of the box to make it 3-0 

before another stunning finish was ruled out by video 

assistant referee (VAR) for offside in the build-up.

China knew they were heading out of the tournament 

unless they responded, so they came out with more 

aggression in the second half, unnerving England 

slightly when Shuang Wang scored from the penalty 

spot after VAR picked up a handball by defender 

Lucy Bronze.


But James was not done yet - she volleyed Jess Carter's 

deep cross past helpless goalkeeper Yu Zhu for 

England's fourth before substitute Chloe Kelly and 

striker Rachel Daly joined the party.


England, who had quietly gone about their business 

in the group stages, will have raised eyebrows with 

this performance before their last-16 match against 

Nigeria on Monday, which will be shown live on 

BBC One at 08:30 BST.


China are out of the competition after Denmark beat 

Haiti to finish in second place in Group D.


England prove worth without Walsh


Manager Sarina Wiegman kept everyone guessing when 

she named her starting XI, with England fans 

anxiously waiting to see how they would set up 

without injured midfielder Keira Walsh.

Wiegman's response was to unleash England's attacking 

talent on a China side who struggled to match them 

in physicality, intensity and sharpness.


James oozed magic and unpredictability, Hemp used her 

pace to test China's defence and captain Millie Bright 

was ferocious in her tackling, winning the ball back 

on countless occasions.


After nudging past Haiti and Denmark with fairly 

underwhelming 1-0 wins, England were keen to impose 

themselves from the start. Although China had chances 

in the second half, they were always second best.


Goalkeeper Mary Earps will be disappointed not to keep 

a clean sheet, but she made two smart saves to deny 

Chen Qiaozhu.


While England are yet to meet a side ranked inside 

the world's top 10 at the tournament, this will help 

quieten doubts that they might struggle without the 

instrumental Walsh.


"We're really growing into the tournament now," said 

captain Bright. "We got a lot of criticism in the first 

two games but we were not concerned at all.


"It's unbelievable to be in the same team as 

[the youngsters]. It feels ridiculous and I'm quite 

proud. Players feeling like they can express themselves 

on the pitch is what we want."


James given standing ovation


The name on everyone's lips following England's win 

over Denmark was 'Lauren James', and those leaving 

Adelaide on Tuesday evening will struggle to forget 

her performance against China any time soon.


She punished China for the space they allowed her on 

the edge of the box in the first half and could have 

had a hat-trick were it not for the intervention of VAR.

Greeted on the touchline by a grinning Wiegman, 

James was substituted with time to spare in the second 

half and went off to a standing ovation from large 

sections of the stadium.


"She's special - a very special player for us and 

for women's football in general," said Kelly. "She's 

a special talent and the future is bright."


She became only the third player on record (since 2011) 

to be directly involved in five goals in a Women's 

World Cup game.


With competition for attacking places in England's 

starting XI extremely high, James has proven she is 

far too good to leave out of the side and is quickly 

becoming a star at this tournament at the age of 21.

"""

prompt = f"""

In the text indicated by three back ticks answer the \

following questions and output your answer as JSON \

using the key names indicated by the word key_name" \

1) What sport was being played? key_name="sport" \

2) What competition was it? key_name="competition" \

3) What teams were playing? key_name = "teams" \

4) Which team won? key_name = "winner" \

5) What was the final score? key_name = "score" \

6) How many people attended the match? key_name = "attendance" \

```{world}```

"""

messages = [{"role": "user", "content": prompt}]

response = (openai

            .ChatCompletion

            .create(model=model,

                    messages=messages,

                    temperature=0)

           )

print(response.choices[0].message["content"])


Here are the results this code produces:

{

  "sport": "Football",

  "competition": "Women's World Cup",

  "teams": "England and China",

  "winner": "England",

  "score": "England 5 - China 1",

  "attendance": 13497

}

This is mostly right, but not quite. The score was actually 6-1. Even worse, the results are very sensitive to the text layout; changing line breaks changes the score.

I ran the same query, but with the Guardian article instead and here's what I got:

{

  "sport": "football",

  "competition": "World Cup",

  "teams": "England and China",

  "winner": "England",

  "score": "6-1",

  "attendance": null

}

With a better prompt, it might be possible to get better consistency and remove some of the formatting inconsistencies. By analyzing multiple articles on the same event, it may be possible to increase the accuracy still further.

Hallucinations

Sometimes ChatGPT gets it very wrong and supplies wildly wrong answers. We've seen a little of that with its analysis of the World Cup game, it wrongly inferred a score of 5-1 when it should have been 6-1. But ChatGPT can get it wrong in much worse ways.

I ran the queries above with text from the BBC and The Guardian. What if I ran the query with no text at all? Here's what I get when there's no text at all to analyze.

{

  "sport": "football",

  "competition": "World Cup",

  "teams": ["France", "Croatia"],

  "winner": "France",

  "score": "4-2",

  "attendance": "80,000"

}

Which is completely made up, hence the term hallucination.

Prompt engineering part 2

Let's go back to my elephant example from earlier and write it this way:

prompt = f"""

In the following text, "The kind old elephant", 

how often does the letter e occur

"""


model="gpt-3.5-turbo"

messages = [{"role": "user", "content": prompt}]


response = (openai

            .ChatCompletion

            .create(model=model,

                    messages=messages,

                    temperature=0)

           )

print(response.choices[0].message["content"])

Here's what the code returns:

In the phrase "The kind old elephant," the letter "e" occurs 4 times.

Which is clearly wrong.

In this case, the problem is the placement of the text to be analyzed. Moving the text to the end of the prompt and being more explicit about what should be returned helps. Even simply adding the phrase "Give your answer as JSON" to the prompt fixes the issue.

This is why the precise form of the prompt you use is critical and why it may take several iterations to get it right.

What does all this mean?

The promise of ChatGPT

It is possible to analyze text and extract information from it. This is huge and transformative for business. Here are just a few of the things that are possible:

  • Press clippings automation.
  • Extraction of information from bills of lading.
  • Automated analysis of SEC filings.
  • Automated analysis of company formation documents.
  • Entity extraction.

We haven't even touched on some of the many other things ChatGPT can do, for example:

  • Language translation.
  • Summarization.
  • Report writing.

How to deliver on that promise

As I've shown in this blog post, the art is in prompt engineering. To get it right, you need to invest a good deal of time in getting your prompts just right and you need to test out your prompts on a wide range of inputs. The good news is, this isn't rocket science.

The skills you need

The biggest change ChatGPT introduces is skill levels. Previously, doing this kind of analysis required a good grasp of theory and underlying libraries. It took quite a lot of effort to build a system to analyze text. Not any more; the skill level has just dropped precipitously; previously, you needed a Ph.D., now you don't. Now it's all about formulating a good prompt and that's something a good analyst can do really well.

The bottom line

ChatGPT, and LLMs in general, are transformative. Any business that relies on information must know how to use them.

Monday, July 31, 2023

Essential business knowledge: the Central Limit Theorem

Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes. 

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean.

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

\[sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i\]

\[sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 } \]

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no. 

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below. 

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40. 

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results 

Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows.  Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient. 

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data. 

General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed. 

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't. 

A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

  • The mode. This is the value that occurs most often.
  • The median. Rank order the data and this is the middle value.
  • The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions. 

The term "central" in the central limit theorem is referring to the central or "average" value.

iid

If you were taught about the Central Limit Theorem, you were probably taught that it only applies to iid data, which means independent and identically distributed. Here's what iid means. 

  • Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
  • All the samples come from the same probability distribution.
Actually, this isn't true. The CLT applies even if the distributions are not the same. However, the independence requirement still holds,

When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

  • as \( n \to \infty \), the sample mean converges to the population mean
  • as \( n \to \infty \), the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution. 

What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply. 

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).