Monday, July 31, 2023

Essential business knowledge: the Central Limit Theorem

Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes.

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean (how confident we are).

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

\[sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i\]

\[sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 } \]

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no.

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below.

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40.

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results

Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows. Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient.

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian.

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data.

General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed.

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't.

A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

The mode. This is the value that occurs most often.
The median. Rank order the data and this is the middle value.
The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions.

The term "central" in the central limit theorem is referring to the central or "average" value.

iid

If you were taught about the Central Limit Theorem, you were probably taught that it only applies to iid data, which means independent and identically distributed. Here's what iid means.

Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
All the samples come from the same probability distribution.

Actually, this isn't true. The CLT applies even if the distributions are not the same. However, the independence requirement still holds,

When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

as \( n \to \infty \), the sample mean converges to the population mean
as \( n \to \infty \), the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution.

What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply.

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).

Tuesday, July 25, 2023

ChatGPT and code generation: be careful

I've heard bold pronouncements that Large Language Models (LLMs), and ChatGPT in particular, will greatly speed up software development with all kinds of consequences. Most of these pronouncements seem to come from 'armchair generals' who are a long way from writing code. I'm going to chime in with my real-world experiences and give you a more realistic view.

D J Shin, CC BY-SA 3.0, via Wikimedia Commons

I've used ChatGPT to generate Python code to solve some small-scale problems. These are things like using an API or doing some simple statistical analysis or chart plotting. Recently, I've branched out to more complex problems, which is where its limitations become more obvious.

In my experience, ChatGPT is excellent for generating code for small problems. It might not solve the problem completely, but it will automate most of the boring pieces and give you a good platform to get going. The code it generates is good with some exceptions. It doesn't generate doc strings for functions, it's light on comments, and it doesn't always follow PEP8 layout, but it does lay out its code clearly and it uses functions well. The supporting documentation it creates is great, in fact, it's much better than the documentation most humans produce.

For larger problems, it falls down, sometimes badly. I gave it a brief to create code to demonstrate the Central Limit Theorem (CLT) using Bokeh charts with several underlying distributions. Part of the brief it did well and it clearly understood how to demonstrate the CLT, but there were problems I had to fix. It generated code for an out-of-date version of Bokeh which required some digging and coding to fix; this could have been cured by simply adding comments about the versions of libraries it was using. It also chose some wrong variable names (it used the reverse of what I would have chosen). More importantly, it did some weird and wrong things with the data at the end of the process, I spotted its mistake in a few minutes and spent 30 minutes rewriting code to correct it. I had similar problems with other longer briefs I gave ChatGPT.

Obviously, the problems I encountered could have been due to incomplete or ambiguous briefs. A solution might have been to spend time refining my brief until it gave me the code I wanted, but that may have taken some time. Which would have been faster, writing new detailed briefs or fixing code that was only a bit wrong?

More worryingly, I spotted what was wrong because I knew the output I expected. What if this had been a new problem where I didn't know what the result should look like?

After playing around with ChatGPT for a while, here are my takeaways:

ChatGPT code generation is about the level of a good junior programmer.
You should use it as a productivity boost to automate the boring bits of coding, a jump start.
Never trust the code and always check what it's doing. Don't use it when you don't know what the result should look like.

Obviously, this is ChatGPT today and the technology isn't standing still. I would expect future versions to improve on commenting etc. What will be harder is the brief. The problem here isn't the LLM, it's with the person writing the brief. English is a very imperfect language for detailed specifications which means we're stuck with ambiguities. I might write what I think is the perfect brief, only to find out I've been imprecise or ambiguous. Technology change is unlikely to fix this problem in the short term.

Of course, other industries have gone through similar disruptive changes in the past. The advent of CAD/CAM didn't mean the end of factory work, it raised productivity at the expense of requiring a higher skill set. The workers with the higher skillset gained, and those with a lesser skillset lost out.

In my view, here's how things are likely to evolve. LLMs will become standard tools to aid data scientists and software developers. They'll be productivity boosters that will require a high skill set to use. The people most negatively impacted will be junior staff or the less skilled, the people who gain the most will be those with experience and a high skill level.

Thursday, May 18, 2023

Isolated track vocals; hear who really can sing

Hear just the singer

Modern signal processing and machine learning can do some incredible things, one of which is to take a song and isolate just the singer's voice. It's called isolated track vocals and it sounds a bit like an a cappella version of a song. It's a bit weird sometimes, but it lets you hear who's a great singer and who just isn't. Here are some notable vocals I thought you might like.

Freddie Mercury - Queen - We Are The Champions

This man could sing. There are lots of Queen songs that have gone through the isolated track vocals process, but I've just chosen one for you to listen to. As you might expect, Freddie Mercury is outstanding.

Nirvana - Smells Like Teen Spirit

The Beatles - We Can Work It Out

The Clash - London Calling

The singing here isn't as good as on some of the other songs, but there's 100% commitment and passion.

Listen to more

You can hear more isolated track vocals on YouTube or SoundCloud, just search for 'isolated track vocals'.

Monday, May 15, 2023

The bad boy of bar charts: William Playfair

A spy, a scoundrel, and a scholar

William Playfair was all three. He led an extraordinary life at the heart of many of the great events of the 18th and 19th centuries, mostly in morally dubious roles. Among all the intrigue, scandal, and indebtedness, he found time to invent the bar and pie charts, and make pioneering use of line charts. As we'll see, he was quite a character.

Playfair the scoundrel

Playfair's lifetime (1759-1823) contained some momentous events:

The development of the steam engine
The French revolution
American independence and the establishment of a new US government
The introduction of paper money

and in different ways, some of them fraudulent, Playfair played a role.

He was born in 1759 in Dundee, Scotland, and due to his father's death, he was apprenticed to the inventor of the threshing machine at age 13. From there, he went to work for one of the leading producers of steam engines, James Watt. So far, this is standard "pull yourself up by your bootstraps with family connections" stuff. Things started to go awry when he moved to London in 1782 where he set up a silversmith company and was granted several patents. The business failed with some hints of impropriety, which was a taste of things to come.

In 1787, he moved to Paris where he sold steam engines and was a dealmaking middleman. This meant he knew the leading figures of French society. He was present at the storming of the Bastille in 1789 and may have had some mid-level command role there. During the revolution, he continued to do deals and work with the French elite, but he made enemies along the way. As the reign of terror got going, Playfair fled the country.

Before fleeing, Playfair had a hand in the Scioto Company, a company formed to sell land to settlers in the Ohio Valley in the new United States. The idea of setting up in a new land was of course attractive to the French elite who could see how the revolution was going. The trouble was, the land was in territory controlled by Native Americans and it was also undeveloped and remote. In other words, completely unsuited for the French Bourgeoisie who were buying the land for a fresh start. The scheme even ended up entangling George Washington. It all ended badly and the US government had to step in to clean up the mess. This is considered to be the first major scandal in US history.

By 1793, Playfair was back in London where he helped formed a security bank, similar to institutions he'd been involved with in France. Of course, it failed with allegations of fraud.

Playfair had always been a good writer and good at explaining data. He'd produced several books and pamphlets, and by the mid-1790s, he was trying to earn a living at it. But things didn't go too well, and he ended up imprisoned for debt in the notorious Fleet Prison (released in 1802). He tried to write his way out of debt, and notably, some of his most influential books were written while in prison.

There were no official government spying agencies at the time, but the British government quite happily paid for freelancers to do it, which may be an early example of "plausible deniability". Playfair was one such freelance secret agent. He discovered the secrets of the breakthrough French semaphore system while living in Frankfurt and handed them over to the British government in the mid-1790s. He was also the mastermind behind an audacious scheme to bring down the French government through massive counterfeiting and inflation. The idea was simple, counterfeit French "paper money" and flood the country with high-quality fakes, stoking inflation and bringing down the currency and hence the government. The scheme may have worked as the currency collapsed and Napoleon took power in a coup in 1799, though Napoleon was worse for the British government than what had existed before.

By 1816, Playfair was broke again. What better way to get money quickly than a spot of blackmail targeted against Lord Archibald Douglas, the wealthiest man in Scotland? If you can dispute his parentage (and therefore his rights to his fortune), you can make a killing. Like many of Playfair's other schemes, this one failed too.

Bar charts and pie charts

Playfair invented the bar chart in his 1786 book, "Commercial and Political Atlas". He wanted to show Scottish imports and exports but didn't have enough data for a time series plot. All he had was imports and exports from different countries and he wanted to display the information in a way that would help his book sell. Here it is, the first bar chart. It shows imports and exports to and from Scotland by country.

This was such a new concept that Playfair had to provide instructions on how to read it.

Playfair's landmark book was "The Statistical Breviary, Shewing on a Principle Entirely New, The Resources of Every State and Kingdom in Europe, Illustrated with Stained Copper-Plate Charts Representing the Physical Powers of Each Distinct Nation with Ease and Perspicuity", which was a statistical economic review of Europe. This book had what may be the first pie chart.

This chart shows how much of the Turkish Empire was geographically European and how much African. Playfair repeated the same type of visualization in 1805's "Statistical Account of the United States of America", but this time in color:

He was an early pioneer of line charts too, as this famous economic chart of England's balance of payments deficits and surpluses shows (again, from 1786's "Commercial and Political Atlas").

Playfair on TV

To my knowledge, there's never been a TV depiction of Playfair, which seems a shame. His life has most of the ingredients for a costume drama mini-series. There would be British Lords and Ladies in period costumes, French aristocrats in all their finery, political intrigue and terror, the guillotine, espionage, fraud on an epic scale (even allowing George Washington to make an appearance), counterfeiting, stream engines and rolling mills (as things to be sold and as things to make counterfeit money), prison, and of course, writing. It could be a kind of Bridgerton for nerds.

Reading more

Surprisingly, William Playfair is a bit niche and there's not that much about him and his works.

The best source of information is "PLAYFAIR: The True Story of the British Secret Agent Who Changed How We See the World" by Bruce Berkowitz. The book digs into Playfair's wild history and is the best source of information on the Scioto Company scandal and counterfeiting.

Here are some other sources you might find useful (note that most of them reference Bruce Berkowitz's book).

The 1950s version of the future

I watched the Coronation last weekend and it reminded me of some childhood experiences I had in Britain. I remember finding and reading old books from around the time of the previous Coronation that talked about the future. I read these books decades after they were published and it was obvious their predictions were way off course, but why they were so wrong is informative. Of course, the books I read were from a British perspective, this was similar in some ways to the American vision of the time, but more black and white.

Hovercraft everywhere

Hovercraft are a British invention and first saw use in the UK in the late 1950s. Of course, the books all had (black and white) photos of hovercraft racing across the waves. The prose was breathless and it was plain the authors felt the future was both air-cushioned and British-led. Uniformly, the authors predicted widespread and worldwide adoption.

(The National Archives UK, No restrictions, via Wikimedia Commons)

By the time I read these books, the problems of hovercraft were becoming apparent; hovercraft as a commercial means of travel were in full retreat. Some of the limitations of hovercraft were well-known in the late 1950s, but none of the books mentioned them. Even as a child, I felt disappointed in the writers' naive optimism, they should have been able to see the flaws in the technology.

Transport

In the future, if we weren't traveling in hovercraft, then we were traveling in cars; no one ever seems to use public transport of any kind. There didn't seem to be a British version of a future car, it all seemed to have been imported from America, including the images. Maybe the writers were already justly cynical about the future of the British car industry.

The Conquest of Space

All the books forecasted a space-faring future and all of them had people living on the moon and in space by the year 2000. In this case, the vision wasn't imported, there was a definite British spin on the space future and the images were home-grown too. There was a belief that Britain would have its own space program, including lunar settlements, space stations, and solar system exploration. All the text and images assumed that these would be British missions; other countries might have programs too, but Britain would have its own, independent, space activities.

The home

Like American versions of the future, British versions of the future were deeply conservative. The family would be a husband and wife and two children, with very traditional gender roles. The husband would travel to work in his futuristic car, the wife would prepare meals in a futuristic kitchen, and the children would play board games in front of a giant TV screen. The kitchen was full of futuristic gadgets to prepare meals in minutes, but the interfaces for these gadgets were always knobs, dials, and switches, and the gadgets were always metal with an enamel finish, no plastics in the future. The TV doubled as a videophone and there were illustrations of the family talking to their relatives in one of the white British ex-colonies.

The future clothes were largely 1950s clothes, the "clothing is silver in the future" idea was a cliche even then.

Society and politics

Oddly, there were very few predictions about society changing and the writers all missed what should have been obvious trends.

Immigration into the UK had happened for centuries with group after group arriving and settling. If the waves of immigration were large enough, they had an impact on British culture, including food. All of the writers seemed to assume that there would be no mass immigration and no changes in diet as a result (in the future, everyone ate 1950s food). Although it would be hard to guess what the immigrant groups would be, it should have been obvious that there would be immigration and that it would change Britain. This is an unforgivable miss and shows the futurists were naive.

None of the writers really dealt with the ongoing consequences of the end of Empire. After independence, many countries continued to do business with Britain, but as colonial ties weakened, they started to do business elsewhere, and as a result, British exports dropped and so did employment. The writers had a sunny optimism that things would continue as before. None of them predicted that the ex-colonies could rise and challenge Britain in any way. The same assumption of British superiority ran through the writing.

Of course, the main assumption behind all of the writing, including the fiction, was that the future was heterosexual, middle-class, and white. The class system was very much intact and very much in its 1950s form. People knew their place and the social hierarchy was solid. Even as a child, I thought this was a suffocating view of the future.

Fiction: Arthur C. Clarke and Dan Dare

I have mixed feelings about the British science fiction of the time. It was naively optimistic, but that was part of its charm and appeal.

The leading science fiction comic strip was "Dan Dare: Pilot of the Future", about a future British space pilot who had adventures battling nefarious aliens across the galaxy. Dan was always the good guy and the aliens were always bad, once again, very black and white. Dan's role was to preserve order for humanity (or Britain). It was Britain as a galactic policeman and force for good. Even today, Dan Dare has a following.

Arthur C. Clarke produced a lot of fiction that was very rooted in British culture of the time and once again was very conservative about society. However, he was sunnily optimistic that somehow things would work out for the best. Ingenuity and bravery would always save the day.

Optimism and conservatism

The two threads running through the different books I read were optimism and conservatism. The optimism was naive but exciting; the authors all believed the future would be a much better place. The conservatism was constraining though and meant they missed big changes they should have seen.

Perhaps optimism and conservatism were a reflection of the times; Britain was still a global power with interests around the world, it had just emerged victorious from World War II but paid a heavy price. The writers were living in a country that was in a relatively strong position relative to others, even other European nations. The rise of Japan and South Korea was still in the future and China was just emerging from its civil war. Maybe British people wanted to believe in a utopian British future and were willing to buy and keep optimistic books that told them comforting things.

What it says

Everyone is wrong about the future, but how they're wrong tells us something about the attitudes and beliefs of the time. These British books of the 1950s forecasted a technologically advanced world with Britain at its core; the world they painted was one where 1950s British values and cultural norms would remain globally dominant. It almost feels as if the writers deliberately built a future in which those values could triumph.

And what of today?

There's a new King and there will be new books forecasting the future. There's plenty written now about how technology may advance and more writing on how society may change. The difference from the 1950s is the lack of consensus on what society's future may be. I see two opposite trends in fiction and in futurology: pessimism and optimism.

The fiction of choice for pessimism is dystopian. The world as we know it comes to an end through war or zombies or a virus, leaving people fighting for survival. The dominant themes are self-reliance and distrust of strangers; people who are different from you are the enemy.

The fiction of choice for optimism is Star Trek or Dr. Who. The future is fundamentally a decent place with the occasional existential crisis. People work together and strangers are mostly good people who could be your friends.

Perhaps this split says a lot about today's society. We create futures where the values we believe in can thrive.

Monday, May 1, 2023

Coworking spaces: the challenge for company loyalty

Coworking spaces are something new

Over the last year, I've spent a couple of weeks working in coworking spaces in London and New York. After spending the last week at a coworking space, I've come away thinking that these spaces represent a profound change for workers and a challenge to how companies relate to their remote staff.

Of course, the rental office space market isn't new; it goes back decades in different countries around the world. What is new is the price, flexibility, and type of workspace. The lowest price tier offers you space in an open-plan office with good wifi, coffee, and maybe other facilities thrown in. The low price, high density, and open-plan nature of the office are what's driving the change.

(A coworking space.)

My experiences

Three things stood out for me in my coworking experience: diversity, energy, and business focus.

I was surprised at the diversity of people I met, they were a much more diverse crowd than any company I've been a part of. By diversity, I mean many things. Obviously, racial and national origin diversity; the people I met were from many different countries with a range of racial backgrounds. But also job roles, I met artists, digital marketers, sales reps, planners, and more. The stereotype is that coworking spaces are full of coders, but that hasn't been my experience. The types of business were wildly different too, everything from infrastructure to car leasing, to contract marketing, to diversity hiring. I heard some really engaging stories that have caused me to think, more so than happens from day-to-day outside of coworking spaces.

The energy was high at all times. Everyone seemed to have a sense of purpose and focus on what they were doing plus the drive to work at it. That's probably a selection bias as these spaces tend to be full of young companies and people working for themselves, but even so, it was good to experience.

Despite the wide range of businesses, everyone was focused on their customers and what they need to do to sell to them. Everyone was keenly aware of the need to make money and the mantra "everyone is in sales" seemed very true for them.

Notably, not all career stages and ages were equally represented. I saw very few people at the start of their careers, the youngest tended to be a few years out of college and on their second or third job. At the other end, I saw very few people who looked to be in their 50s and no one who looked close to retirement. On the whole, people tended to be late 20s or early 30s.

Where things get interesting are the events and services these coworking spaces provide. Many spaces offer a barista and some serve beer and wine after 5pm. I've seen wine tastings and other social events. Some places have one-off business services like professional headshots and so on. These are exactly the types of services and events companies offer to their on-site staff, and this is where the challenge comes.

The coworking challenge

All companies try to promote loyalty, which requires staff proximity and communication. Loyalty helps with productivity, goal alignment, and stability; a loyal workforce will stay with a company during tough times. Social programs, 1:1 meetings, and group meetings all help with proximity, and newsletters and Slack, etc. help with communications, but these things are much harder with a remote workforce.

Look at what happens in a coworking space. You get proximity because others share the space with you and the coworking space runs social events to encourage mixing (and loyalty). You get communications too, many coworking spaces send out email newsletters, and so on.

Now imagine you're a remote employee working out of a coworking space. Imagine it's 3pm on a Thursday and your company is running a social event over Zoom. At the same time, your coworking space is offering an in-person social event with all the people you meet every day in the office with beer and wine. Which event would you go to?

What about lunches? Some companies offer to pay for remote employees' lunches on special occasions, but the employee has to order their lunch and submit an expense claim (effort). By contrast, if a coworking space offers a free lunch, all the employee has to do is turn up and eat. Which would you prefer?

As a remote employee, would you be more loyal to your coworking space or your employer?

What this means

There is a form of loyalty competition between the company a worker works for and the coworking space the worker uses. The coworking space has the upper hand in the way the loyalty game is mostly played today. But there are other ways to generate loyalty, for example, promotions and pay rises, training and staff development, conferences, and so on; things which add lasting value to an employee.

Companies need to realize that the remote experience is different, especially if someone is in a coworking space. If companies want loyal staff, they have to offer something meaningful because coworking spaces are using loyalty levers too and they have the decisive physical advantage.

Sunday, April 16, 2023

You can be too clever with visualizations

Front page of the paper

I saw this visualization on Friday, April 14th 2023, on the front page of the New York Times. It was an illustration for an article about increasing the use of electricity to tackle climate change in the United States. Unfortunately, the visualization is at best confusing. You can read the full article here: https://www.nytimes.com/interactive/2023/04/14/climate/electric-car-heater-everything.html

(New York Times 14th April, 2023)

The message

The message the article was trying to convey was a simple one: it was a modeling exercise for a more electrified US with lower energy consumption. The animations in the article made clear where electricity use would need to grow and roughly by how much.

Why the presentation is bad

The visualization is a sort of pie chart. In most cases, pie charts are very bad data visualizations, and this article compounds the problem by using a non-standard form.

Just looking at the charts, can you tell me what the percentages are for Transportation, Industrial, etc. now and in the "electrified future"? Can you tell me what's growing?

The article makes plain the modeling work is for reduced energy consumption. Looking at the two charts, can you tell me what the reduction is and over what timescale it occurs?

I could go on, but it's easy to see for yourself what's wrong. Look at the charts and tell me what you take away from them.

The article contains animations that make the message clearer, but even so, it took me a lot of work to figure out what was going on. This takes us to the major visualization sin here: the level of effort to understand what's going on is too high.

What's the takeaway?

You can get too clever with visualizations. Just because you can, doesn't mean you should. Keep things simple and easy to understand.