Saturday, March 26, 2022

Plagiarism and blog posts

Imitation is not the sincerest form of flattery

Prior to the pandemic, I wrote a thought piece on data science. It compared the work of data science to building Lego models and called back to some of my childhood memories of building Lego models with my brothers. I deliberately wrote it to have a slightly dreamy and nostalgic quality. I was very pleased with the finished piece and I referenced it from my LinkedIn profile. You can read it here: https://www.truefit.com/blog/Data-is-the-New-Lego.

The other day, I was thinking about this piece and did a Google search on it. I found someone had plagiarized it. They'd taken the whole article and replaced a few sentences with their 'own' work. They'd even used the same type of images I did. It was pretty much a word-for-word copy (to be clear: it's blindingly obvious this is a direct copy of my work). Of course, they didn't acknowledge my piece at all. What was truly galling was a comment someone had made calling the piece insightful. The plagiarist replied commenting that they were glad they liked it. 

(Hariadhi, CC BY-SA 3.0, via Wikimedia Commons)

The plagiarist has several other pieces on Medium. I have no idea if they copied the other pieces too. They're studying data science and on their profile, they say they want to tell stories with data. Perhaps the biggest story they're telling is that they cheat and take credit for other people's work.

The borders of originality

In this case, the copying was a blatant lift of my work, but other cases are more difficult. There's a nuanced question of what's plagiarism and what's not, for example, many people have written stories about time machines after H.G. Wells, are they all guilty of plagiarism? 

For me, the line is the story arc and ideas. If you're telling the same story as someone else and using the same ideas, you're on very thin ice. If you're using the same metaphors, similies, or allegories then you've crossed the line. If you must tell the same story as someone else (and you really shouldn't), at least use your own imagery.

What have I done?

On the person's Medium post, I have called out their plagiarism and I've reported the piece as violating Medium's terms and conditions. It was posted in the "Towards Data Science" publication so I complained to them too. The Towards Data Science team removed the author from their publication and reported the plagiarism to Medium. I reported the author for plagiarism to Medium again.

It also set me thinking about the interview process. I've looked at people's Github pages and their portfolios. Up to now, it didn't occur to me that people might blatantly cheat. After this experience, I'm going to up my checks.

Wednesday, March 9, 2022

What brown M&Ms can tell you about a company

Small things reveal deeper truths

I was reading an old story on the internet and it struck me that there's something I could learn from it about diagnosing company culture. I'll tell you the story and show you how small things can be very revealing.

The Van Halen story

Here's a quote from David Lee Roth’s autobiography, Crazy from the Heat, that tells the story. 

"Van Halen was the first band to take huge productions into tertiary, third-level markets. We’d pull up with nine eighteen-wheeler trucks, full of gear, where the standard was three trucks, max. And there were many, many technical errors — whether it was the girders couldn’t support the weight, or the flooring would sink in, or the doors weren’t big enough to move the gear through. The contract rider read like a version of the Chinese Yellow Pages because there was so much equipment, and so many human beings to make it function. So just as a little test, in the technical aspect of the rider, it would say “Article 148: There will be fifteen amperage voltage sockets at twenty-foot spaces, evenly, providing nineteen amperes . . .” This kind of thing. And article number 126, in the middle of nowhere, was: “There will be no brown M&M’s in the backstage area, upon pain of forfeiture of the show, with full compensation.”

So, when I would walk backstage, if I saw a brown M&M in that bowl . . . well, line-check the entire production. Guaranteed you’re going to arrive at a technical error. They didn’t read the contract. Guaranteed you’d run into a problem. Sometimes it would threaten to just destroy the whole show. Something like, literally, life-threatening."

In other words, the no brown M&Ms clause was a simple compliance check that the venue had read the contract and taken it seriously. It was an easy test of much deeper problems.

(This would fail the test - there are brown M&Ms! Evan-Amos, Public domain, via Wikimedia Commons)

Tells

The brown M&Ms story shows that something simple can be used to uncover a fundamental and harder-to-check problem. The same idea appears in Poker too - it's the ideas that players have "tells" that reveal something about their hands. It occurred to me that over the years, I'd seen something similar in business. I've seen cases where companies have made sweeping statements about culture but small actions have given the game away. Unlike the Van Halen story, the tells are usually unintentional, but nonetheless, they're there. Here are some examples.

Our onboarding is the best, but we won't pay you

Years ago, I worked for a company that made a big deal of how great its onboarding was; the CEO and other executives claimed it was "industry-leading" and praised the process. 

When I was onboarded, the company messed up its payroll and didn't pay me for a while; way past the legal deadline. I asked when it was going to be resolved and I was told I should "manage my finances better". I later learned this was a common experience and many new employees weren't paid on time, the "manage your finances better" was the stock response. In one extreme case, I know someone who wasn't paid for over two months.

As it turned out, this was a brown M&Ms case. It indicated profound issues at the company and in particular with the executive team; they were too remote from what was going on and they really weren't interested in hearing anything except praise. It took me and others a long time to discover these issues. The brown M&Ms should have warned us very early that something was quite broken. 

I'm too important to talk to you

At another company, a new C-level executive joined the organization and there was a long announcement about how great they were and how they exhibited the company values, one of which was being people-centric. I reported to the new person's organization. 

One day, early on in their tenure, the new C-level person visited the office I was working at. They walked straight by me and my team without stopping to say hello. During the week they were with us, they didn't meet or talk with any of us. They even managed to avoid being in the break room at the same time as the little people (and people tried very hard to meet the new executive). On that visit, the new C-level person didn't meet or say hello to anyone below vice-president level. Later on, they gave a talk to their organization that included a discussion of the necessity of connecting with people and how it was important to them.

I didn't see many of their other actions, but this was very definitely a brown M&M moment for me. I saw trouble ahead and left the company not long after, and I wasn't the only one.

Candies: going, going, gone

My last example is actually about candy. 

I worked for a company that provided candy and snacks. It was very proud that what it provided was top quality, and I agreed; it really did provide great treats. The company presented top-quality candy and snacks as a way of showing how much it valued its employees; we were told that we got the best because we were valued. 

You can probably guess what happened next. The snack and candy brands went from well-known brands to own-label brands, while the company insisted that nothing had changed. After a few months of own-label brands, the candy and snacks stopped altogether, and the company never said a word. A number of other things happened too, including worse terms and conditions for new employees (less leave etc.), more restrictions on travel, and fewer corporate lunches, but these were harder to see. The company started valuing employees less and the treats and candies were only the most visible of several actions that took place at the same time; they were the canary in the coal mine.

What can you do?

Small issues can give you a clue that things are deeply broken in hard-to-detect ways. You should be on the lookout for brown M&M moments that give you advance warning of problems.

As an employee, these moments provide insight into what the company really is. If the M&M moment is serious enough, it's time to think about employment elsewhere, even if you've just started.

As an executive, you need to be aware that you're treated differently from other people. You might not experience the brown M&M moment yourself, but people in your organization might. Listen to people carefully and hear these moments; use them to diagnose deeper issues in your organization and fix the root cause. Be aware that this is one of the few moments in your life you might get to be like David Lee Roth.

Saturday, February 26, 2022

W.E.B. Du Bois - data scientist

Changing the world through charts

Two of the key skills of a data scientist are informing and persuading others through data. I'm going to show you how one man, and his team, used novel visualizations to illustrate the lives of African-Americans in the United States at the start of the 20th century. Even though they created their visualizations by hand, these visualizations still have something to teach us over 100 years later. The team's lack of computers freed them to try different forms of data visualizations; sometimes their experimentation was successful, sometimes less so, but they all have something to say and there's a lesson here on communication for today's data scientists.

I'm going to talk about W.E.B. Du Bois and the astounding charts his team created for the 1900 Paris exhibition.

(W.E.B. Du Bois in 1904 and one of his 1900 data visualizations.)

Who was W.E.B. Du Bois?

My summary isn't going to do his amazing life justice so I urge you to read any of these short descriptions of who he was and what he did:

To set the scene here's just a very brief list of some of the things he did. Frankly, summarizing his life in a few lines is ridiculous.

  • Born 1868, Great Barrington, Massachusetts
  • Graduate of Fisk University and Harvard - the first African-American to gain a Ph.D. from Harvard
  • Conducted ground-breaking sociological work in Philadelphia, Virginia, Alabama, and Georgia
  • His son died in 1899 because no white doctor would treat him and black doctors were unavailable
  • Was the primary organizer of "The Exhibit of American Negroes" at the Exposition Universelle held in Paris between April and November 1900
  • NAACP director and editor of the NAACP magazine The Crisis
  • Debated Lothrop Stoddard, a "scientific racist" in 1929 and thoroughly bested him.
  • Opposed US involvement in World War I and II.
  • Life-long peace activist and campaigner, which led to the FBI investigating him in the 1950s as a suspected communist. They withheld his passport for 8 years.
  • Died in Ghana in 1963.

Visualizing Black America at the start of the twentieth century

In 1897, Du Bois was a history professor at Atlanta University. His former classmate and friend, Thomas Junius Calloway, asked him to produce a study of African-Americans for the 1900 Paris world fair, the "Exposition Universelle". With the help of a large team of Atlanta University students and alumni, Du Bois gathered statistics on African-American life over the years and produced a series of infographics to bring the data to life. Most of the names of the people who worked on the project are unknown, and it's a mystery who originated the form of the plots, but the driving force behind the project was undoubtedly Du Bois. Here are some of my favorite infographics from the Paris exhibition.

The chart below shows where African-Americans lived in Georgia in 1890. There are four categories: 

  • Red - country and villages
  • Yellow - cities 2,500-5,000
  • Blue - cities 5,000-10,000
  • Green - cities over 10,000

the length of the lines is proportional to the population and obviously, the chart shows the huge majority of the population lived in the country and villages. I find the chart striking for three reasons: it doesn't follow any of the modern charting conventions, it clearly represents the data, and it's visually very striking. My criticism is that the design makes it hard to visually quantify the differences, for example, how many more people live in the country and villages compared to cities 5,000-10,000? If I were drawing a chart with the same data today, I might use an area chart to represent the same data; it would quantify things better, but it would be far less visually interesting.


The next infographic is two choropleth charts that show the African-American population of Georgia counties in 1870 and 1880. Remember that the US civil war ended in 1865, and with the Union victory came freedom for the slaves. As you might expect, there was a significant movement of the now-free people. Looking at the charts in detail raises several questions, for example, why did some areas see a growth in the African-American population while other areas did not? Why did the highest populated areas remain the highest populated? The role of any good visualization is to prompt meaningful questions.

This infographic shows the income and expenditure of 150 African-American families in Georgia. The income bands are on the left-hand side, and the bar chart breaks down the families' expenses by category:

  • Black - rent
  • Purple - food
  • Pink - clothes
  • Dark blue - direct taxes
  • Light blue - other expenses and savings

There are several notable observations from this chart: the disappearance of rent above a certain income level, the rise in other expenses and savings with rising income, and the declining fraction spent on clothing. There's a lot on this chart and it's worthy of greater study; Du Bois' team crammed a great deal of meaning into a single page. For me, the way the key is configured at the top of the chart doesn't quite work, but I'm willing to give the team a pass on this because it was created in the 19th century. A chart like this wouldn't look out of place in a 2022 report - which of itself is startling.

My final example is a comparison of the occupations of African-Americans and the white population in Georgia. It's a sort-of pie chart, with the upper quadrant showing African Americans and the bottom quadrant showing the white population. Occupations are color-coded:

  • Red - agriculture, fishery, and mining
  • Yellow - domestic and personal service
  • Blue - manufacturing and mechanical industries
  • Grey - trade and transportation
  • Brown - professions

The fraction of the population in these employment categories is written on the slices, though it's hard to read because the contrast isn't great. Notably, the order of the occupations is reversed from the top to the bottom quadrant, which has the effect of making the sizes of the slices easier to compare - this can't be an accident. I'm not a fan of pie charts, but I do like this presentation.

Influences on later European movements - or not?

Two things struck me about Du Bois' charts: how modern they looked and how similar they were to later art movements like the Italian Futurists and Bauhaus. 

At first glance, his charts look to me like they'd been made in the 1960s. The typography and coloring were obviously pre-computerization, but everything else about them suggests modernity, from the typography to the choice of colors to the layout. The experimentation with form is striking and is another reason why this looks very 1960s to me; perhaps the use of computers to visualize data has constrained us too much. Remember, Du Bois's mission was to explain and convince and he chose his charts and their layout to do so, hence the experimentation with form. It's quite astonishing how far ahead of his time he was.  

Italian Futurism started in 1909 and largely fizzled out at the end of the second world war due to its close association with fascism. The movement emphasized the abstract representation of dynamism and technology among other things. Many futurist paintings used a restricted color palette and have obvious similarities with Du Bois' charts, here are just a few examples (below). I couldn't find any reliable articles that examined the links between Du Bois' work and futurism.

Numbers In Love - Giacomo Balla
Image from WikiArt
Music - Luigi Russolo
Image from WikiArt

The Bauhaus design school (1919-1933) sought to bring modernity and artistry into mass production and had a profound and lasting effect on the design of everyday things, even into the present day. Bauhaus designs tend to be minimal ("less is more") and focus on functionality ("form follows function") but can look a little stark. I searched, but I couldn't find any scholarly study of the links between Du Bois and Bauhaus, however, the fact the Paris exposition charts and the Bauhaus work use a common visual language is striking. Here's just one example, a poster for the Bauhaus school from 1923.

(Joost Schmidt, Public domain, via Wikimedia Commons)

Du Bois' place in data visualization

I've read a number of books on data visualization. Most of them include Nightingale's coxcomb plots and Playfair's bar and pie charts, but none of them included Du Bois charts.  Du Bois didn't originate any new chart types, which is maybe why the books ignore him, but his charts are worth studying because of their experimentation with form, their use of color, and most important of all, their ability to communicate meaning clearly. Ultimately, of course, this is the only purpose of data visualization.

Reading more

W. E. B. Du Bois's Data Portraits: Visualizing Black America, Whitney Battle-Baptiste, Britt Rusert. This is the book that brought these superb visualizations to a broader audience. It includes a number of full-color plates showing the infographics in their full glory.

The Library of Congress has many more infographics from the Paris exhibition, it also has photos too. Take a look at it for yourself here https://www.loc.gov/collections/african-american-photographs-1900-paris-exposition/?c=150&sp=1&st=list - but note the charts are towards the end of the list. I took all my charts in this article from the Library of Congress site. 

"W.E.B. Du Bois’ Visionary Infographics Come Together for the First Time in Full Color" article in the Smithsonian magazine that reviews the Battle-Baptiste book (above).

"W. E. B. Du Bois' Hand-Drawn Infographics of African-American Life (1900)" article in Public Domain Review that reviews the Battle-Baptiste book (above).

Friday, February 18, 2022

RCT bingo!

A vocabulary of causal inference testing

I was having a clear-out and I came across a printout of some notes I made a while back. It was a list of terms used in causal inference testing. At the time, I used it as a checklist or dictionary to ensure I knew what I was talking about - a kind of RCT bingo if you like.

(Myriam Thomas, CC BY-SA 4.0, via Wikimedia Commons)

I thought I would post it here in case anyone wants to play the same game. Do you know what all these terms mean? Are there key terms I've missed off my list?

  • ATE - Average Treatment Effect
  • CATE - Conditional Average Treatment Effect
  • Counterfactual
  • DAG - Directed Acyclic Graph
  • Dynamic Treatment Effect
  • Epsilon greedy
  • Estimands
  • External and internal validity
  • Heterogeneity (treatment effect heterogeneity) 
  • Homophily
  • Instrumental Variable (IV)
  • LATE - Local Average Treatment Effect
  • Logit model
  • RCT - Randomized Control Trial
  • Regret
  • Salience
  • Spillover
  • Stationary effect (and it's opposite non-stationary effect)
  • Surrogate
  • SUTVA - Stable Unit Treatment Value Assumption
  • Thompson sampling
  • Treatment effect heterogeneity
  • Wald estimator

Monday, January 17, 2022

Cultural add or fit?

What does cultural fit mean?

At a superficial level, company culture can be presented as free food and drink, table tennis and foosball, and of course company parties. More realistically, it means how people interact with each other, what behavior is encouraged, and crucially what behavior isn't tolerated.  At the most fundamental level, it means who gets hired, fired, or promoted. 

Cultural fit means how well someone can function within a company or team. At best, it means their personality and the company's way of operating are aligned so the person thrives within the company, performs well, and stays a long time. In this case, everyone's a winner.

For a long time, managers have hired for cultural fit because of the benefits of getting it right.

The unintended consequences

Although cultural fit seems like a good thing to hire for, it has several downsides. 

Hiring for cultural fit over the long term means that you can get groupthink. In some situations that's fine, for example, mature or slowly moving industries benefit from a consistent approach over time. But during periods of rapid change, it can be bad because the team doesn't have the diversity of thought to effectively respond to threats; the old ways don't work anymore but the team still fights yesterday's battles.

For poorly performing teams, hiring for cultural fit can mean more of the same, which can be disastrous on two levels: it cements the existing problems and blocks new ways of working.

(Monks in a monastery are a great example of cultural fit. But not everyone wants to join a monastery. Abraham Sobkowski OFM, CC BY-SA 3.0, via Wikimedia Commons)

Cultural add

In contrast to cultural fit that focuses on conformity, cultural add focuses on what new and different things an employee can bring to the team. 

Cultural add is not (entirely) about racial diversity; in fact, I would argue it's a serious error to view cultural add solely in racial terms. I've worked with teams composed of individuals from different races, but they all went to the same universities and all had the same middle-class backgrounds. The team looked and sounded diverse but their thinking was strikingly uniform.

Here are some areas of cultural add you might think about:

  • Someone who followed a non-traditional path to get to where they got. This can mean:
    • Military experience
    • Non-university education
    • They transitioned from one discipline to another (e.g. someone who initially majored in law now working in machine learning).
  • Single parents. Many young tech teams are full of young single people. A single parent has a radically different set of experiences. They may well bring a much-needed focus on work-life balance.
  • Older candidates. Their experience in different markets and different companies may be just what you need.
  • Working-class backgrounds. Most people in tech come from middle-class backgrounds (regardless of country of origin). Someone whose parents were very blue-collar may well offer quite a different perspective.

I'm not saying anything new when I say a good hiring process considers the strengths and weaknesses of a team before the hiring process starts. For example, if a team is weak on communication with others, a desirable feature of a new hire is good communications skills. Cultural add takes this one stage further and actively looks for candidates who bring something new to the table, even when that new thing isn't well-defined.

Square pegs in round holes

The cultural add risk is the same as any hiring risk: you get someone who can't work with the team or who can't perform. Even with cultural add, you still need to recruit someone the team can work with. Cultural add can't be the number one hiring criteria, but it should be a key one. 

What all this means in practice

We can boil this down to some don'ts and dos.

Don'ts

  • Hire people who went to the same small group of universities.
  • Assume racial diversity = cultural add.
  • Add people who are exactly the same as the current team.
  • Rely on employee referrals (people tend to know people who are like them).
Do:
  • Look for people with non-traditional backgrounds.
  • Be aware of the hiring channels you use and try and reach out beyond the usual channels. 
  • Look for what new thing or characteristic the candidate brings. This means thinking about the interview questions you ask to find the new thing.
  • Think about your hiring process and how the process itself filters candidates. If you have a ten-stage process, or a long take-home test, or you do multiple group interviews, this can cause candidates to drop out - maybe even the candidates you most want to attract.

Cultural add goes beyond the hiring process, you have to think about how a person is welcomed. I've seen teams unintentionally (and intentionally) freeze people out because they were a bit different. If you really want to make cultural add work, management has to commit to making it work post-hire. 

An old joke

Two men become monks and join a monastery. One of the men is admitted because he's a cultural fit, the other because he's a cultural add. 

After dinner one evening, the monks are silent for a while, then one monk says "23" and the other monks smile. After a few minutes, another monk very loudly says "82", and the monks laugh. This goes on for a while to the confusion of the two newcomers. The abbot whispers to them: "We've been together so long, we know each other's jokes, so we've numbered them to save time". The new monks decide to join in.

The cultural fit monk says "82" and there's polite laughter - they've just heard the same joke. The cultural add monk thinks for a second and says "189". There's a pause for a second as the monks look at one another in wonder, then they burst out in side-splitting laughing. Some of the monks are crying with laughter and one looks like he might need oxygen. The laughter goes on for a good ten minutes. The abbot turns to the cultural add monk and says: "they've never heard that one before!".

If you want more of the same, go for cultural fit, if you want something new, go for cultural add.

Friday, January 7, 2022

Prediction, distinction, and interpretation: the three parts of data science

What does data science boil down to?

Data science is a relatively new discipline that means different things to different people (most notably, to different employers). Some organizations focus solely on machine learning, while other lean on interpretation, and yet others get close to data engineering. In my view, all of these are part of the data science role. 

I would argue data science generally is about three distinct areas:

  • Prediction. The ability to accurately extrapolate from existing data sets to make forecasts about future behavior. This is the famous machine learning aspect and includes solutions like recommender systems.
  • Distinction. The key question here is: "are these numbers different?". This includes the use of statistical techniques to decide if there's a difference or not, for example, specifying an A/B test and explaining its results. 
  • Interpretation. What are the factors that are driving the system? This is obviously related to prediction but has similarities to distinction too.

(A similar view of data science to mine: Calvin.Andrus, CC BY-SA 3.0, via Wikimedia Commons)

I'm going to talk through these areas and list the skills I think a data scientist needs. In my view, to be effective, you need all three areas. The real skill is to understand what type of problem you face and to use the correct approach.

Distinction - are these numbers different?

This is perhaps the oldest area and the one you might disagree with me on. Distinction is firmly in the realm of statistics. It's not just about A/B tests or quasi-experimental tests, it's also about evaluating models too.

Here's what you need to know:

  • Confidence intervals.
  • Sample size calculations. This is crucial and often overlooked by experienced data scientists. If your data set is too small, you're going to get junk results so you need to know what too small is. In the real world. increasing the sample size is often not an option and you need to know why.
  • Hypothesis testing. You should know the difference between a t-test and a z-test and when a z-test is appropriate (hint: sample size).
  • α, β, and power. Many data scientists have no idea what statistical power is. If you're doing any kind of statistical testing, you need to have a firm grasp of power.
  • The requirements for running a randomized control trial (RCT). Some experienced data scientists have told me they were analyzing results from an RCT, but their test just wasn't an RCT - they didn't really understand what an RCT was.
  • Quasi-experimental methods. Sometimes, you just can't run an RCT, but there are other methods you can use including difference-in-difference, instrumental variables, and regression discontinuity.  You need to know which method is appropriate and when. 
  • Regression to the mean. This is why you almost always need a control group. I've seen experienced data scientists present results that could almost entirely be explained by regression to the mean. Don't be caught out by one of the fundamentals of statistics.

Prediction - what will happen next?

This is the piece of data science that gets all the attention, so I won't go into too much detail.

Here's what you need to know:

  • The basics of machine learning models, including:
    • Generalized linear modeling
    • Random forests (including knowing why they are often frowned upon)
    • k-nearest neighbors/k-means clustering
    • Support Vector Machines
    • Gradient boosting.
  • Cross-validation, regularization, and their limitations.
  • Variable importance and principal component analysis.
  • Loss functions, including RMSE.
  • The confusion matrix, accuracy, sensitivity, specificity, precision-recall and ROC curves.

There's one topic that's not on any machine learning course or in any machine learning book that I've ever read, but it's crucially important: knowing when machine learning fails and when to stop a project.  Machine learning doesn't work all the time.

Interpretation - what's going on?

The main techniques here are often data visualization. Statistical summaries are great, but they can often mislead. Charts give a fuller picture. 

Here are some techniques all data scientists should know:

  • Heatmaps
  • Violin plots
  • Scatter plots and curve fitting
  • Bar charts
  • Regression and curve fitting.

They should also know why pie charts in all their forms are bad. 

A good knowledge of how charts work is very helpful too (the psychology of visualization).

What about SQL and R and Python...?

You need to be able to manipulate data to do data science, which means SQL, Python, or R. But plenty of people use these languages without being data scientists. In my view, despite their importance, they're table stakes.

Book knowledge vs. street knowledge

People new to data science tend to focus almost exclusively on machine learning (prediction in my terminology) which leaves them very weak on data analysis and data exploration; even worse, their lack of statistical knowledge sometimes leads them to make blunders on sample size and loss functions. No amount of cross-validation, regularization, or computing power will save you from poor modeling choices. Even worse, not knowing statistics can lead people to produce excellent models of regression to the mean.

Practical experience is hugely important; way more important than courses. Obviously, a combination of both is best, which is why PhDs are highly sought after; they've learned from experience and have the theoretical firepower to back up their practical knowledge.

Friday, December 31, 2021

COVID and the base rate fallacy

COVID and the base rate fallacy

Should we be concerned that vaccinated people are getting COVID?

I’ve spoken to people who’re worried that the COVID vaccines aren’t effective because some vaccinated people catch COVID and are hospitalized. Let’s look at the claim and see if it stands up to analysis.

Let's start with some facts:

Marc Rummy’s diagram

Marc Rummy created this diagram to explain what’s going on with COVID hospitalizations. He’s made it free to share, which is fantastic.

In this diagram, the majority of the population is vaccinated (91%). The hospitalization rate for the unvaccinated is 50% but for the vaccinated, it’s 10%. If the total population is 110, this leads to 5 unvaccinated people hospitalized and 10 vaccinated people hospitalized - in other words, 2/3 of those in hospital with COVID have been vaccinated. 

Explaining the result

Let’s imagine we just looked at hospitalizations: 5 unvaccinated and 10 vaccinated. This makes it look like vaccinations aren’t working – after all, the majority of people in hospital are vaccinated. You can almost hear ignorant journalists writing their headlines now (“Questions were raised about vaccine effectiveness when the health minister revealed the majority of patients hospitalized had been vaccinated.”). But you can also see anti-vaxxers seizing on these numbers to try and make a point about not getting vaccinated.

The reason why the numbers are the way they are is because the great majority of people are vaccinated

Let’s look at three different scenarios with the same population of 110 people and the same hospitalization rates for vaccinated and unvaccinated:

  • 0% vaccinated – 55 people hospitalized
  • 91% vaccinated – 15 people hospitalized
  • 100% vaccinated – 11 people hospitalized

Clearly, vaccinations reduce the number of hospitalizations. The anti-vaccine argument seems to be, if it doesn't reduce the risk to zero, it doesn't work - which is a strikingly weak and ignorant argument.

In this example, vaccination doesn’t reduce the risk of infection to zero, it reduces it by a factor of 5. In the real world, vaccination reduces the risk of infection by 5x and the risk of death due to COVID by 13x (https://www.nytimes.com/interactive/2021/us/covid-cases.html). The majority of people hospitalized now appear to be unvaccinated even though vaccination rates are only just above 60% in most countries (https://www.nytimes.com/interactive/2021/world/covid-cases.html, https://www.masslive.com/coronavirus/2021/09/breakthrough-covid-cases-in-massachusetts-up-to-about-40-while-unvaccinated-people-dominate-hospitalizations.html).

The bottom line is very simple: if you want to reduce your risk of hospitalization and protect your family and community, get vaccinated.

The base rate fallacy

The mistake the anti-vaxxers and some journalists are making is a very common one, it’s called the base rate fallacy (https://thedecisionlab.com/biases/base-rate-fallacy/). There are lots of definitions online, so I’ll just attempt a summary here: “the base rate fallacy is where someone draws an incorrect conclusion because they didn’t take into account the base rate in the general population. It’s especially a problem for conditional probability problems.”

Let’s use another example from a previous blog post:

“Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

  • If you have the disease, the test will give a positive result 99% of the time (sensitivity).
  • If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease?” 

The answer is 50%.

The reason why the answer is 50% and not 99% is because 99% of the town’s population does not have the disease (the base rate), which means half of the positives will be false positives.

What’s to be done?

Conditional probability (for example, the COVID hospitalization data) is screwy and can sometimes seem counter to common sense. The general level of statistical (and probability) knowledge in the population is poor. This leaves people trying to make sense of the data around them but without the tools to do it, so no wonder they’re confused.

It’s probably time that all schoolchildren are taught some basic statistics. This should include some counter-intuitive results (for example, the disease example above). Even if very few schoolchildren grow up to analyze data, it would be beneficial for society if more people understood that interpreting data can be hard and that sometimes surprising results occur – but that doesn’t make them suspicious or wrong.

More importantly, journalists need to do a much better job of telling the truth and explaining the data instead of chasing cheap clicks.