Monday, August 7, 2023

How to run webinars

Why do you care about running webinars?

For sales and marketing people, the answer is obvious, to generate leads and do your job. For technical people, things are a little murkier, and as a result, technical people sometimes make avoidable mistakes when they give webinars.

In this blog post, I'll explain how and why a technical person should approach and run webinars. At the end, I'll link to a longer report where I go through the entire process from start to finish.

My experiences

I've run webinars in big companies and small companies and I've had my share of problems. I've faced visual and audio issues, planning issues, marketing issues and on and on. I've learned from what went wrong so I can advise you on what to do.  Here's my summary advice: make sure you understand the whole process end-to-end so you can step in to fix any shortcomings.

What value do you bring?

Why should anyone come to your webinar? 

The marketing department may have asked you to do a webinar, but frankly, they're not going to answer this question for you. If it isn't clear why anyone should attend your webinar, then you're not going to get a good audience. Webinars are not free to attend: they cost your attendees their time, which is extremely valuable. To justify spending someone's time, here are some questions you should ask:

  • who should attend?
  • what will they learn?
  • what will they take away?

Before you do anything, you need to be clear on these points.

Let's take an example of where engineers fall down: webinars for new minor releases. The marketing team wants a webinar on the new release with the goal of increasing leads. The problem is, the new release is a minor one, really only of interest to existing customers. Unfortunately, the engineering team will only commit resources to a release webinar, so that's what gets scheduled. This is a common siuation and the irreconcilable conflict of goals and resources and will lead to the webinar failing. In this case, the engineers and the marketing team need to discuss what's really needed, perhaps there are two webinars, one focused on the new functionality for existing customers and a new webinar on the product overall for prospects. It needs an honest discussion in the company.

I go into this in a lot more detail in my report.

Is the marketing in order?

In almost all cases, the goal of a webinar is to generate sales leads. Usual measures of success are leads generated or sales contributions. To be successful then, the marketing behind the webinar must be effective. This means:

  • a clear and unambiguous value proposition
  • a compelling summary
  • a clearly defined market demographic (e.g. the job titles and organizations you want to reach)
  • an effective recruitment campaign (registration page, social media outreach, email etc.)
  • a compelling call to action at the end of the webinar (e.g. register for more content)

If some or all of these steps are missing, the webinar will be a disappointment.

These steps are usually under the control of the marketing department, but I've done webinars where some or all of these steps were missing and the results were'n't good. Even if you're a completely technical person, you need to ensure that the marketing for your webinar is effective.

Does the webinar have a good story?

This means the webinar must tell a compelling story and have a consistent narrative with a begining, middle, and end. It should finish with a clear and unambiguous call to action.

A good test of whether you have a good story is the 30 second summary. Summarize your webinar in a 30 second pitch. Does it sound good? If not, try again.

Is the audio-visual setup good enough?

Some of this is obvious, but some of it isn't. Audio filtering can clean up some background noises, but not others, for example, you can't filter out echoes. Here's my checklist:

  • Good quality microphone plus a good pop filter - the pop filter is extremely important. 
  • Record your webinar in an acoustically quiet environment. This means few background noises and as much sound deadening material as possible. A bedroom is good place to record a webinar because all the soft furnishings help deaden noise.
  • Make sure your demos work end-to-end. If at all possible, pre-record them and play the recording out during the webinar (but be careful about the technology).

Duration and Q&A

Don't do more than 25 minutes and stick to your schedule. Don't overrun. Leave your audience wanting more, which means you can offer more material as a follow-up (and excuse for more interaction and selling).

Don't leave Q&A to chance. Have a set of canned questons and answers ready to go if your audience is slow to ask questions or the questions are ones you don't want to answer.

The complete guide

This is a small taster of what you have to do to make a webinar succesful. I've expanded a lot on my thoughts and written a comprehensive guide, covering everything from microphone selection to landing pages. You can get my guide by clicking on the link below.



Thursday, August 3, 2023

Using ChatGPT for real to interpret text

What's real and what isn't with ChatGPT?

There's a huge amount of hype surrounding ChatGPT and I've heard all kinds of "game changing" stories around it. But what's real and what's not?

In this blog post, I'm going to show you one of the real things ChatGPT can do: extract meaning from text. I'll show you how well it performs, discuss some of its shortcomings, and highlight important considerations for using it in business. I'm going to do it with real code and real data.

We're going to use ChatGPT to extract meaning from news articles, specifically, two articles on the Women's World Cup.

D J Shin, CC BY-SA 3.0, via Wikimedia Commons. I for one, welcome our new robot overlords...

The Women's World Cup

At the time of writing, the Women's World Cup is in full swing and England have just beaten China 6-1. There were plenty of news stories about it, so I took just two and tried to extract structured, factual data from the articles.

Here are the two articles:

Here is the data I wanted to pull out of the text:
  • The sport being played
  • The competition
  • The names of the teams
  • Who won
  • The score
  • The attendance
I wanted it in a structured format, in this case, JSON.

Obviously, you could read the articles and just extract the information, but the value of ChatGPT is doing this at scale, to scan thousands or millions of articles to search for key pieces of information. Up until now, this has been done by paying people in the developing world to read articles and extract data. ChatGPT offers the prospect of slashing the cost of this kind of work and making it widely available.

Let's see it in action.

Getting started

This example is all in Python and I'm assuming you have a good grasp of the language.

Download the OpenAI library:

pip install openai

Register for OpenAI and get an API key. At the time of writing, you get $5 in free credits and this tutorial won't consume much of that $5.

You'll need to set your API key in your code. To get going, we'll just paste it into our Python file:

import openai
openai.api_key = "YOUR_KEY"

You should note that OpenAI will rescind any keys they find on the public internet. My use of the key in code is very sloppy from a security point of view. Only do it to get started.

Some ChatGPT basics

We're going to focus on just one part of ChatGPT, the ChatCompletion API. Because there's some complexity here, I'm going to go through some of the background before diving into the code.

To set the certainty of its answers, ChatGPT has a concept of "temperature". This is a parameter that sets how "sure" the answer is; the lower the number the more sure the answer. A more certain answer comes at the price of creativity, so for some applications, you might want to choose a higher temperature (for example, you might want a higher temperature for a chatbot). The temperature range is 0 to 1, and we'll use 0 for this example because we want highly reliable analysis.

There are several ChatGPT models each with a different pricing structure. As you might expect, the larger and more recent models are more expensive, so for this tutorial, I'm going to use an older and cheaper model, "gpt-3.5-turbo", that works well enough to show what ChatGPT can do.

ChatGPT works on a model of "roles" and "messages". Roles are the actors in a chat; for a chatbot there will be a "user" role, which is the human entering text, an "assistant" role which is the chat response, and a "system" role controlling the assistant. Messages are the text from the user or the assistant or a "briefing" for the system. For a chatbot, we need multiple messages, but to extract meaning from text, we just need one. To analyze the World Cup articles, we only need the user role.

To get an answer, we need to pose a question or give ChatGPT an instruction on what to do. That's part of the "content" we set in the messages parameter. The content must contain the text we want to analyze and instructions on what we want returned. This is a bigger topic and I'm going to dive into it next.

Prompt engineering part 1

Setting the prompt correctly is the core of ChatGBP and it's a bit of an art, which is why it's been called prompt engineering. You have to very carefully write your prompt to get the results you expect.

Oddly, ChatGPT doesn't separate the text from the query; they're all bundled together in the same prompt. This means you have to clearly tell ChatGPT what you want to analyze and how you want it analyzed.

Let's start with a simple example, let's imagine you want to know how many times the letter "e" occurs in the text "The kind old elephant." Here's how you might write the prompt:

f"""In the following text, how often does the letter e occur:

"The kind old elephant"

"""

This gives us the correct answer (3). We'll come back to this prompt later because it shows some of the pitfalls of working with ChatGPT. In general, we need to be crystal clear about the text we want the system to analyze.

Let's say we wanted the result in JSON, here's how we might write the prompt:

f"""

In the following text, how often does the letter e occur, write your answer as JSON:

"The kind old elephant"

"""

Which gives us {"e": 3}

We can ask more complex questions about some text, but we need to very carefully layout the query and distinguish between text and questions. Here's an example.

prompt = f"""

In the text indicated by three back ticks answer the \

following questions and output your answer as JSON \

using the key names indicated by the word "key_name" \

1) how often does the letter e occur key_name = "letter" \

2) what animal is referred to key_name = "animal" \

```The kind old elephant```

"""

Using ChatGPT

Let's put what we've learned together and build a ChatGPT query to ask questions about the Women's World Cup. Here's the code using the BBC article.

world = """

Lauren James produced a sensational individual 

performance as England entertained to sweep aside 

China and book their place in the last 16 of the 

Women's World Cup as group winners.


It was a display worthy of their status as European 

champions and James once again lit the stage alight 

in Adelaide with two sensational goals and three assists.


The 13,497 in attendance were treated to a masterclass 

from Chelsea's James, who announced her arrival at the 

World Cup with the match-winner against Denmark on Friday.


She helped England get off to the perfect start when 

she teed up Alessia Russo for the opener, and 

later slipped the ball through to Lauren Hemp to 

coolly place it into the bottom corner.


It was largely one-way traffic as England dominated 

and overwhelmed, James striking it first time into 

the corner from the edge of the box to make it 3-0 

before another stunning finish was ruled out by video 

assistant referee (VAR) for offside in the build-up.

China knew they were heading out of the tournament 

unless they responded, so they came out with more 

aggression in the second half, unnerving England 

slightly when Shuang Wang scored from the penalty 

spot after VAR picked up a handball by defender 

Lucy Bronze.


But James was not done yet - she volleyed Jess Carter's 

deep cross past helpless goalkeeper Yu Zhu for 

England's fourth before substitute Chloe Kelly and 

striker Rachel Daly joined the party.


England, who had quietly gone about their business 

in the group stages, will have raised eyebrows with 

this performance before their last-16 match against 

Nigeria on Monday, which will be shown live on 

BBC One at 08:30 BST.


China are out of the competition after Denmark beat 

Haiti to finish in second place in Group D.


England prove worth without Walsh


Manager Sarina Wiegman kept everyone guessing when 

she named her starting XI, with England fans 

anxiously waiting to see how they would set up 

without injured midfielder Keira Walsh.

Wiegman's response was to unleash England's attacking 

talent on a China side who struggled to match them 

in physicality, intensity and sharpness.


James oozed magic and unpredictability, Hemp used her 

pace to test China's defence and captain Millie Bright 

was ferocious in her tackling, winning the ball back 

on countless occasions.


After nudging past Haiti and Denmark with fairly 

underwhelming 1-0 wins, England were keen to impose 

themselves from the start. Although China had chances 

in the second half, they were always second best.


Goalkeeper Mary Earps will be disappointed not to keep 

a clean sheet, but she made two smart saves to deny 

Chen Qiaozhu.


While England are yet to meet a side ranked inside 

the world's top 10 at the tournament, this will help 

quieten doubts that they might struggle without the 

instrumental Walsh.


"We're really growing into the tournament now," said 

captain Bright. "We got a lot of criticism in the first 

two games but we were not concerned at all.


"It's unbelievable to be in the same team as 

[the youngsters]. It feels ridiculous and I'm quite 

proud. Players feeling like they can express themselves 

on the pitch is what we want."


James given standing ovation


The name on everyone's lips following England's win 

over Denmark was 'Lauren James', and those leaving 

Adelaide on Tuesday evening will struggle to forget 

her performance against China any time soon.


She punished China for the space they allowed her on 

the edge of the box in the first half and could have 

had a hat-trick were it not for the intervention of VAR.

Greeted on the touchline by a grinning Wiegman, 

James was substituted with time to spare in the second 

half and went off to a standing ovation from large 

sections of the stadium.


"She's special - a very special player for us and 

for women's football in general," said Kelly. "She's 

a special talent and the future is bright."


She became only the third player on record (since 2011) 

to be directly involved in five goals in a Women's 

World Cup game.


With competition for attacking places in England's 

starting XI extremely high, James has proven she is 

far too good to leave out of the side and is quickly 

becoming a star at this tournament at the age of 21.

"""

prompt = f"""

In the text indicated by three back ticks answer the \

following questions and output your answer as JSON \

using the key names indicated by the word key_name" \

1) What sport was being played? key_name="sport" \

2) What competition was it? key_name="competition" \

3) What teams were playing? key_name = "teams" \

4) Which team won? key_name = "winner" \

5) What was the final score? key_name = "score" \

6) How many people attended the match? key_name = "attendance" \

```{world}```

"""

messages = [{"role": "user", "content": prompt}]

response = (openai

            .ChatCompletion

            .create(model=model,

                    messages=messages,

                    temperature=0)

           )

print(response.choices[0].message["content"])


Here are the results this code produces:

{

  "sport": "Football",

  "competition": "Women's World Cup",

  "teams": "England and China",

  "winner": "England",

  "score": "England 5 - China 1",

  "attendance": 13497

}

This is mostly right, but not quite. The score was actually 6-1. Even worse, the results are very sensitive to the text layout; changing line breaks changes the score.

I ran the same query, but with the Guardian article instead and here's what I got:

{

  "sport": "football",

  "competition": "World Cup",

  "teams": "England and China",

  "winner": "England",

  "score": "6-1",

  "attendance": null

}

With a better prompt, it might be possible to get better consistency and remove some of the formatting inconsistencies. By analyzing multiple articles on the same event, it may be possible to increase the accuracy still further.

Hallucinations

Sometimes ChatGPT gets it very wrong and supplies wildly wrong answers. We've seen a little of that with its analysis of the World Cup game, it wrongly inferred a score of 5-1 when it should have been 6-1. But ChatGPT can get it wrong in much worse ways.

I ran the queries above with text from the BBC and The Guardian. What if I ran the query with no text at all? Here's what I get when there's no text at all to analyze.

{

  "sport": "football",

  "competition": "World Cup",

  "teams": ["France", "Croatia"],

  "winner": "France",

  "score": "4-2",

  "attendance": "80,000"

}

Which is completely made up, hence the term hallucination.

Prompt engineering part 2

Let's go back to my elephant example from earlier and write it this way:

prompt = f"""

In the following text, "The kind old elephant", 

how often does the letter e occur

"""


model="gpt-3.5-turbo"

messages = [{"role": "user", "content": prompt}]


response = (openai

            .ChatCompletion

            .create(model=model,

                    messages=messages,

                    temperature=0)

           )

print(response.choices[0].message["content"])

Here's what the code returns:

In the phrase "The kind old elephant," the letter "e" occurs 4 times.

Which is clearly wrong.

In this case, the problem is the placement of the text to be analyzed. Moving the text to the end of the prompt and being more explicit about what should be returned helps. Even simply adding the phrase "Give your answer as JSON" to the prompt fixes the issue.

This is why the precise form of the prompt you use is critical and why it may take several iterations to get it right.

What does all this mean?

The promise of ChatGPT

It is possible to analyze text and extract information from it. This is huge and transformative for business. Here are just a few of the things that are possible:

  • Press clippings automation.
  • Extraction of information from bills of lading.
  • Automated analysis of SEC filings.
  • Automated analysis of company formation documents.
  • Entity extraction.

We haven't even touched on some of the many other things ChatGPT can do, for example:

  • Language translation.
  • Summarization.
  • Report writing.

How to deliver on that promise

As I've shown in this blog post, the art is in prompt engineering. To get it right, you need to invest a good deal of time in getting your prompts just right and you need to test out your prompts on a wide range of inputs. The good news is, this isn't rocket science.

The skills you need

The biggest change ChatGPT introduces is skill levels. Previously, doing this kind of analysis required a good grasp of theory and underlying libraries. It took quite a lot of effort to build a system to analyze text. Not any more; the skill level has just dropped precipitously; previously, you needed a Ph.D., now you don't. Now it's all about formulating a good prompt and that's something a good analyst can do really well.

The bottom line

ChatGPT, and LLMs in general, are transformative. Any business that relies on information must know how to use them.

Monday, July 31, 2023

Essential business knowledge: the Central Limit Theorem

Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes. 

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean.

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

\[sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i\]

\[sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 } \]

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no. 

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below. 

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40. 

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results 

Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows.  Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient. 

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data. 

General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed. 

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't. 

A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

  • The mode. This is the value that occurs most often.
  • The median. Rank order the data and this is the middle value.
  • The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions. 

The term "central" in the central limit theorem is referring to the central or "average" value.

iid

If you were taught about the Central Limit Theorem, you were probably taught that it only applies to iid data, which means independent and identically distributed. Here's what iid means. 

  • Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
  • All the samples come from the same probability distribution.
Actually, this isn't true. The CLT applies even if the distributions are not the same. However, the independence requirement still holds,

When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

  • as \( n \to \infty \), the sample mean converges to the population mean
  • as \( n \to \infty \), the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution. 

What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply. 

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).

Tuesday, July 25, 2023

ChatGPT and code generation: be careful

I've heard bold pronouncements that Large Language Models (LLMs), and ChatGPT in particular, will greatly speed up software development with all kinds of consequences. Most of these pronouncements seem to come from 'armchair generals' who are a long way from writing code. I'm going to chime in with my real-world experiences and give you a more realistic view.

D J Shin, CC BY-SA 3.0, via Wikimedia Commons

I've used ChatGPT to generate Python code to solve some small-scale problems. These are things like using an API or doing some simple statistical analysis or chart plotting. Recently, I've branched out to more complex problems, which is where its limitations become more obvious.

In my experience, ChatGPT is excellent for generating code for small problems. It might not solve the problem completely, but it will automate most of the boring pieces and give you a good platform to get going. The code it generates is good with some exceptions. It doesn't generate doc strings for functions, it's light on comments, and it doesn't always follow PEP8 layout, but it does lay out its code clearly and it uses functions well. The supporting documentation it creates is great, in fact, it's much better than the documentation most humans produce. 

For larger problems, it falls down, sometimes badly. I gave it a brief to create code to demonstrate the Central Limit Theorem (CLT) using Bokeh charts with several underlying distributions. Part of the brief it did well and it clearly understood how to demonstrate the CLT, but there were problems I had to fix. It generated code for an out-of-date version of Bokeh which required some digging and coding to fix; this could have been cured by simply adding comments about the versions of libraries it was using. It also chose some wrong variable names (it used the reverse of what I would have chosen). More importantly, it did some weird and wrong things with the data at the end of the process, I spotted its mistake in a few minutes and spent 30 minutes rewriting code to correct it. I had similar problems with other longer briefs I gave ChatGPT.

Obviously, the problems I encountered could have been due to incomplete or ambiguous briefs. A solution might have been to spend time refining my brief until it gave me the code I wanted, but that may have taken some time. Which would have been faster, writing new detailed briefs or fixing code that was only a bit wrong? 

More worryingly,  I spotted what was wrong because I knew the output I expected. What if this had been a new problem where I didn't know what the result should look like?

After playing around with ChatGPT for a while, here are my takeaways:

  • ChatGPT code generation is about the level of a good junior programmer.
  • You should use it as a productivity boost to automate the boring bits of coding, a jump start.
  • Never trust the code and always check what it's doing. Don't use it when you don't know what the result should look like.

Obviously, this is ChatGPT today and the technology isn't standing still. I would expect future versions to improve on commenting etc. What will be harder is the brief. The problem here isn't the LLM, it's with the person writing the brief. English is a very imperfect language for detailed specifications which means we're stuck with ambiguities. I might write what I think is the perfect brief, only to find out I've been imprecise or ambiguous. Technology change is unlikely to fix this problem in the short term.

Of course, other industries have gone through similar disruptive changes in the past. The advent of CAD/CAM didn't mean the end of factory work, it raised productivity at the expense of requiring a higher skill set. The workers with the higher skillset gained, and those with a lesser skillset lost out.

In my view, here's how things are likely to evolve. LLMs will become standard tools to aid data scientists and software developers. They'll be productivity boosters that will require a high skill set to use. The people most negatively impacted will be junior staff or the less skilled, the people who gain the most will be those with experience and a high skill level.

Thursday, May 18, 2023

Isolated track vocals; hear who really can sing

Hear just the singer

Modern signal processing and machine learning can do some incredible things, one of which is to take a song and isolate just the singer's voice. It's called isolated track vocals and it sounds a bit like an a cappella version of a song. It's a bit weird sometimes, but it lets you hear who's a great singer and who just isn't. Here are some notable vocals I thought you might like.

Freddie Mercury - Queen - We Are The Champions

This man could sing. There are lots of Queen songs that have gone through the isolated track vocals process, but I've just chosen one for you to listen to. As you might expect, Freddie Mercury is outstanding.

Nirvana - Smells Like Teen Spirit


The Beatles - We Can Work It Out


The Clash - London Calling

The singing here isn't as good as on some of the other songs, but there's 100% commitment and passion.


Listen to more

You can hear more isolated track vocals on YouTube or SoundCloud, just search for 'isolated track vocals'.

Monday, May 15, 2023

The bad boy of bar charts: William Playfair

A spy, a scoundrel, and a scholar

William Playfair was all three. He led an extraordinary life at the heart of many of the great events of the 18th and 19th centuries, mostly in morally dubious roles. Among all the intrigue, scandal, and indebtedness, he found time to invent the bar chart and pie chart and make pioneering use of line charts. As we'll see, he was quite a character.

Playfair the scoundrel

Playfair's lifetime (1759-1823) contained some momentous events:

  • The development of the steam engine
  • The French revolution
  • American independence and the establishment of a new US government
  • The introduction of paper money

and in different ways, some of them fraudulent, Playfair played a role.

He was born in 1759 in Dundee, Scotland, and due to his father's death, he was apprenticed to the inventor of the threshing machine at age 13. From there, he went to work for one of the leading producers of steam engines, James Watt. So far, this is standard "pull yourself up by your bootstraps with family connections" stuff. Things started to go awry when he moved to London in 1782 where he set up a silversmith company and was granted several patents. The business failed with some hints of impropriety, which was a taste of things to come.

In 1787, he moved to Paris where he sold steam engines and was a dealmaking middleman. This meant he knew the leading figures of French society. He was present at the storming of the Bastille in 1789 and may have had some mid-level command role there. During the revolution, he continued to do deals and work with the French elite, but he made enemies along the way. As the reign of terror got going, Playfair fled the country.

Before fleeing, Playfair had a hand in the Scioto Company, a company formed to sell land to settlers in the Ohio Valley in the new United States. The idea of setting up in a new land was of course attractive to the French elite who could see how the revolution was going.  The trouble was, the land was in territory controlled by Native Americans and it was also undeveloped and remote. In other words, completely unsuited for the French Bourgeoisie who were buying the land for a fresh start. The scheme even ended up entangling George Washington. It all ended badly and the US government had to step in to clean up the mess. This is considered to be the first major scandal in US history.

By 1793, Playfair was back in London where he helped formed a security bank, similar to institutions he'd been involved with in France.  Of course, it failed with allegations of fraud.

Playfair had always been a good writer and good at explaining data. He'd produced several books and pamphlets, and by the mid-1790s, he was trying to earn a living at it. But things didn't go too well, and he ended up imprisoned for debt in the notorious Fleet Prison (released in 1802). He tried to write his way out of debt, and notably, some of his most influential books were written while in prison.

There were no official government spying agencies at the time, but the British government quite happily paid for freelancers to do it, which may be an early example of "plausible deniability". Playfair was one such freelance secret agent. He discovered the secrets of the breakthrough French semaphore system while living in Frankfurt and handed them over to the British government in the mid-1790s. He was also the mastermind behind an audacious scheme to bring down the French government through massive counterfeiting and inflation.  The idea was simple, counterfeit French "paper money" and flood the country with high-quality fakes, stoking inflation and bringing down the currency and hence the government. The scheme may have worked as the currency collapsed and Napoleon took power in a coup in 1799, though Napoleon was worse for the British government than what had existed before.

By 1816, Playfair was broke again. What better way to get money quickly than a spot of blackmail targeted against Lord Archibald Douglas, the wealthiest man in Scotland? If you can dispute his parentage (and therefore his rights to his fortune), you can make a killing. Like many of Playfair's other schemes, this one failed too.

Bar charts and pie charts

Playfair invented the bar chart in his 1786 book, "Commercial and Political Atlas". He wanted to show Scottish imports and exports but didn't have enough data for a time series plot. All he had was imports and exports from different countries and he wanted to display the information in a way that would help his book sell. Here it is, the first bar chart. It shows imports and exports to and from Scotland by country.


This was such a new concept that Playfair had to provide instructions on how to read it.

Playfair's landmark book was "The Statistical Breviary, Shewing on a Principle Entirely New, The Resources of Every State and Kingdom in Europe, Illustrated with Stained Copper-Plate Charts Representing the Physical Powers of Each Distinct Nation with Ease and Perspicuity", which was a statistical economic review of Europe. This book had what may be the first pie chart.


This chart shows how much of the Turkish Empire was geographically European and how much African. Playfair repeated the same type of visualization in 1805's "Statistical Account of the United States of America", but this time in color:


He was an early pioneer of line charts too, as this famous economic chart of England's balance of payments deficits and surpluses shows (again, from 1786's "Commercial and Political Atlas").


Playfair on TV

To my knowledge, there's never been a TV depiction of Playfair, which seems a shame. His life has most of the ingredients for a costume drama mini-series.  There would be British Lords and Ladies in period costumes, French aristocrats in all their finery, political intrigue and terror, the guillotine, espionage, fraud on an epic scale (even allowing George Washington to make an appearance), counterfeiting, stream engines and rolling mills (as things to be sold and as things to make counterfeit money), prison, and of course, writing. It could be a kind of Bridgerton for nerds.

Reading more

Surprisingly, William Playfair is a bit niche and there's not that much about him and his works.

The best source of information is "PLAYFAIR: The True Story of the British Secret Agent Who Changed How We See the World" by Bruce Berkowitz. The book digs into Playfair's wild history and is the best source of information on the Scioto Company scandal and counterfeiting.

Here are some other sources you might find useful (note that most of them reference Bruce Berkowitz's book).

Tuesday, May 9, 2023

The Coronation and visions of the future

The 1950s version of the future

I watched the Coronation last weekend and it reminded me of some childhood experiences I had in Britain. I remember finding and reading old books from around the time of the previous Coronation that talked about the future. I read these books decades after they were published and it was obvious their predictions were way off course. The books I read were from a British perspective, which was similar in some ways to the American vision, but more black and white.

Hovercraft everywhere

Hovercraft are a British invention and first saw use in the UK in the late 1950s. Of course, the books all had (black and white) photos of hovercraft racing across the waves. The prose was breathless and it was plain the authors felt the future was both air-cushioned and British-led. Uniformly, the authors predicted widespread and worldwide adoption. 

(The National Archives UK, No restrictions, via Wikimedia Commons)

By the time I read these books, the problems of hovercraft were becoming apparent; hovercraft as a commercial means of travel were in full retreat. Some of the limitations of hovercraft were well-known in the late 1950s, but none of the books mentioned them, and even as a child, I felt disappointed in the writers' naive optimism. 

Transport

In the future, if we weren't traveling in hovercraft, then we were traveling in cars; no one ever seems to use public transport of any kind. There didn't seem to be a British version of a future car, it all seemed to have been imported from America, including the images. Maybe the writers were already justly cynical about the future of the British car industry.

The Conquest of Space

All the books forecasted a space-faring future and all of them had people living on the moon and in space by the year 2000. In this case, the vision wasn't imported, there was a definite British spin on the space future and the images were home-grown too. There was a belief that Britain would have its own space program, including lunar settlements, space stations, and solar system exploration. All the text and images assumed that these would be British missions; other countries might have programs too, but Britain would have its own, independent, space activities.

The home

Like American versions of the future, British versions of the future were deeply conservative. The family would be a husband and wife and two children, with very traditional gender roles. The husband would travel to work in his futuristic car, the wife would prepare meals in a futuristic kitchen, and the children would play board games in front of a giant TV screen. The kitchen was full of futuristic gadgets to prepare meals in minutes, but the interfaces for these gadgets were always knobs, dials, and switches, and they were always metal with an enamel finish, no plastics in the future. The TV doubled as a videophone and there were illustrations of the family talking to their relatives in one of the white British ex-colonies.

The future clothes were largely 1950s clothes, the "clothing is silver in the future" idea was a cliche even then.

Society and politics

Oddly, there were very few predictions about society changing and the writers all missed what should have been obvious trends.

Immigration into the UK had happened for centuries with group after group arriving and settling. If the waves of immigration were large enough, they had an impact on British culture, including food. All of the writers seemed to assume that there would be no mass immigration and no changes in diet as a result (in the future, everyone ate 1950s food). Although it would be hard to guess what the immigrant groups would be, it should have been obvious that there would be immigration and that it would change Britain. This is an unforgivable miss and shows the futurists were naive. 

None of the writers really dealt with the ongoing consequences of the end of Empire. After independence, many countries continued to do business with Britain, but as colonial ties weakened, they started to do business elsewhere, and as a result, British exports dropped and so did employment. The writers had a sunny optimism that things would continue as before. None of them predicted that the ex-colonies could rise and challenge Britain in any way. The same assumption of British superiority ran through the writing.

Of course, the main assumption behind all of the writing, including the fiction, was that the future was heterosexual, middle-class, and white. The class system was very much intact and very much in its 1950s form. People knew their place and the social hierarchy was solid. Even as a child, I thought this was a suffocating view of the future.

Fiction: Arthur C. Clarke and Dan Dare

I have mixed feelings about the British science fiction of the time. It was naively optimistic, but that was part of its charm and appeal.

The leading science fiction comic strip was "Dan Dare: Pilot of the Future", about a future British space pilot who had adventures battling nefarious aliens across the galaxy. Dan was always the good guy and the aliens were always bad, once again, very black and white. Dan's role was to preserve order for humanity (or Britain). It was Britain as a galactic policeman and force for good. Even today, Dan Dare has a following.

Arthur C. Clarke produced a lot of fiction that was very rooted in British culture of the time and once again was very conservative about society. However, he was sunnily optimistic that somehow things would work out for the best. Ingenuity and bravery would always save the day. 

Optimism and conservatism

The two threads running through the different books I read were optimism and conservatism. The optimism was naive but exciting; the authors all believed the future would be a much better place. The conservatism was constraining though and meant they missed big changes they should have seen.

Perhaps optimism and conservatism were a reflection of the times; Britain was still a global power with interests around the world, it had just emerged victorious from World War II but paid a heavy price. The writers were living in a country that was in a relatively strong position relative to others, even other European nations. The rise of Japan and South Korea was still in the future and China was just emerging from its civil war. Maybe British people wanted to believe in a utopian British future and were willing to buy and keep optimistic books that told them comforting things.

What it says

Everyone is wrong about the future, but how they're wrong tells us something about the attitudes and beliefs of the time. These British books of the 1950s forecasted a technologically advanced world with Britain at its core; the world they painted was one where 1950s British values and cultural norms would remain globally dominant. It almost feels as if the writers deliberately built a future in which those values could triumph. 

And what of today?

There's a new King and there will be new books forecasting the future. There's plenty written now about how technology may advance and more writing on how society may change. The difference from the 1950s is the lack of consensus on what society's future may be. I see two opposite trends in fiction and in futurology: pessimism and optimism.

The fiction of choice for pessimism is dystopian. The world as we know it comes to an end through war or zombies or a virus, leaving people fighting for survival. The dominant themes are self-reliance and distrust of strangers; people who are different from you are the enemy.

The fiction of choice for optimism is Star Trek or Dr. Who. The future is fundamentally a decent place with the occasional existential crisis. People work together and strangers are mostly good people who could be your friends. 

Perhaps this split says a lot about today's society. We create futures where the values we believe in can thrive.