Showing posts with label data visualization. Show all posts
Showing posts with label data visualization. Show all posts

Saturday, February 26, 2022

W.E.B. Du Bois - data scientist

Changing the world through charts

Two of the key skills of a data scientist are informing and persuading others through data. I'm going to show you how one man, and his team, used novel visualizations to illustrate the lives of African-Americans in the United States at the start of the 20th century. Even though they created their visualizations by hand, these visualizations still have something to teach us over 100 years later. The team's lack of computers freed them to try different forms of data visualizations; sometimes their experimentation was successful, sometimes less so, but they all have something to say and there's a lesson here on communication for today's data scientists.

I'm going to talk about W.E.B. Du Bois and the astounding charts his team created for the 1900 Paris exhibition.

(W.E.B. Du Bois in 1904 and one of his 1900 data visualizations.)

Who was W.E.B. Du Bois?

My summary isn't going to do his amazing life justice so I urge you to read any of these short descriptions of who he was and what he did:
To set the scene here's just a very brief list of some of the things he did. Frankly, summarizing his life in a few lines is ridiculous.
  • Born 1868, Great Barrington, Massachusetts
  • Graduate of Fisk University and Harvard - the first African-American to gain a Ph.D. from Harvard
  • Conducted ground-breaking sociological work in Philadelphia, Virginia, Alabama, and Georgia
  • His son died in 1899 because no white doctor would treat him and black doctors were unavailable
  • Was the primary organizer of "The Exhibit of American Negroes" at the Exposition Universelle held in Paris between April and November 1900
  • NAACP director and editor of the NAACP magazine The Crisis
  • Debated Lothrop Stoddard, a "scientific racist" in 1929 and thoroughly bested him.
  • Opposed US involvement in World War I and II.
  • Life-long peace activist and campaigner, which led to the FBI investigating him in the 1950s as a suspected communist. They withheld his passport for 8 years.
  • Died in Ghana in 1963.

Visualizing Black America at the start of the twentieth century

In 1897, Du Bois was a history professor at Atlanta University. His former classmate and friend, Thomas Junius Calloway, asked him to produce a study of African-Americans for the 1900 Paris world fair, the "Exposition Universelle". With the help of a large team of Atlanta University students and alumni, Du Bois gathered statistics on African-American life over the years and produced a series of infographics to bring the data to life. Most of the names of the people who worked on the project are unknown, and it's a mystery who originated the form of the plots, but the driving force behind the project was undoubtedly Du Bois. Here are some of my favorite infographics from the Paris exhibition.

The chart below shows where African-Americans lived in Georgia in 1890. There are four categories: 

  • Red - country and villages
  • Yellow - cities 2,500-5,000
  • Blue - cities 5,000-10,000
  • Green - cities over 10,000
the length of the lines is proportional to the population and obviously, the chart shows the huge majority of the population lived in the country and villages. I find the chart striking for three reasons: it doesn't follow any of the modern charting conventions, it clearly represents the data, and it's visually very striking. My criticism is that the design makes it hard to visually quantify the differences, for example, how many more people live in the country and villages compared to cities 5,000-10,000? If I were drawing a chart with the same data today, I might use an area chart to represent the same data; it would quantify things better, but it would be far less visually interesting.

The next infographic is two choropleth charts that show the African-American population of Georgia counties in 1870 and 1880. Remember that the US civil war ended in 1865, and with the Union victory came freedom for the slaves. As you might expect, there was a significant movement of the now-free people. Looking at the charts in detail raises several questions, for example, why did some areas see a growth in the African-American population while other areas did not? Why did the highest populated areas remain the highest populated? The role of any good visualization is to prompt meaningful questions.

This infographic shows the income and expenditure of 150 African-American families in Georgia. The income bands are on the left-hand side, and the bar chart breaks down the families' expenses by category:

  • Black - rent
  • Purple - food
  • Pink - clothes
  • Dark blue - direct taxes
  • Light blue - other expenses and savings
There are several notable observations from this chart: the disappearance of rent above a certain income level, the rise in other expenses and savings with rising income, and the declining fraction spent on clothing. There's a lot on this chart and it's worthy of greater study; Du Bois' team crammed a great deal of meaning into a single page. For me, the way the key is configured at the top of the chart doesn't quite work, but I'm willing to give the team a pass on this because it was created in the 19th century. A chart like this wouldn't look out of place in a 2022 report - which of itself is startling.

My final example is a comparison of the occupations of African-Americans and the white population in Georgia. It's a sort-of pie chart, with the upper quadrant showing African Americans and the bottom quadrant showing the white population. Occupations are color-coded:

  • Red - agriculture, fishery, and mining
  • Yellow - domestic and personal service
  • Blue - manufacturing and mechanical industries
  • Grey - trade and transportation
  • Brown - professions
The fraction of the population in these employment categories is written on the slices, though it's hard to read because the contrast isn't great. Notably, the order of the occupations is reversed from the top to the bottom quadrant, which has the effect of making the sizes of the slices easier to compare - this can't be an accident. I'm not a fan of pie charts, but I do like this presentation.

Influences on later European movements - or not?

Two things struck me about Du Bois' charts: how modern they looked and how similar they were to later art movements like the Italian Futurists and Bauhaus. 

At first glance, his charts look to me like they'd been made in the 1960s. The typography and coloring were obviously pre-computerization, but everything else about them suggests modernity, from the typography to the choice of colors to the layout. The experimentation with form is striking and is another reason why this looks very 1960s to me; perhaps the use of computers to visualize data has constrained us too much. Remember, Du Bois's mission was to explain and convince and he chose his charts and their layout to do so, hence the experimentation with form. It's quite astonishing how far ahead of his time he was.  

Italian Futurism started in 1909 and largely fizzled out at the end of the second world war due to its close association with fascism. The movement emphasized the abstract representation of dynamism and technology among other things. Many futurist paintings used a restricted color palette and have obvious similarities with Du Bois' charts, here are just a few examples (below). I couldn't find any reliable articles that examined the links between Du Bois' work and futurism.

Numbers In Love - Giacomo Balla
Image from WikiArt
Music - Luigi Russolo
Image from WikiArt

The Bauhaus design school (1919-1933) sought to bring modernity and artistry into mass production and had a profound and lasting effect on the design of everyday things, even into the present day. Bauhaus designs tend to be minimal ("less is more") and focus on functionality ("form follows function") but can look a little stark. I searched, but I couldn't find any scholarly study of the links between Du Bois and Bauhaus, however, the fact the Paris exposition charts and the Bauhaus work use a common visual language is striking. Here's just one example, a poster for the Bauhaus school from 1923.

(Joost Schmidt, Public domain, via Wikimedia Commons)

Du Bois' place in data visualization

I've read a number of books on data visualization. Most of them include Nightingale's coxcomb plots and Playfair's bar and pie charts, but none of them included Du Bois charts.  Du Bois didn't originate any new chart types, which is maybe why the books ignore him, but his charts are worth studying because of their experimentation with form, their use of color, and most important of all, their ability to communicate meaning clearly. Ultimately, of course, this is the only purpose of data visualization.

Reading more

W. E. B. Du Bois's Data Portraits: Visualizing Black America, Whitney Battle-Baptiste, Britt Rusert. This is the book that brought these superb visualizations to a broader audience. It includes a number of full-color plates showing the infographics in their full glory.

The Library of Congress has many more infographics from the Paris exhibition, it also has photos too. Take a look at it for yourself here - but note the charts are towards the end of the list. I took all my charts in this article from the Library of Congress site. 

"W.E.B. Du Bois’ Visionary Infographics Come Together for the First Time in Full Color" article in the Smithsonian magazine that reviews the Battle-Baptiste book (above).

"W. E. B. Du Bois' Hand-Drawn Infographics of African-American Life (1900)" article in Public Domain Review that reviews the Battle-Baptiste book (above).

Friday, January 7, 2022

Prediction, distinction, and interpretation: the three parts of data science

What does data science boil down to?

Data science is a relatively new discipline that means different things to different people (most notably, to different employers). Some organizations focus solely on machine learning, while other lean on interpretation, and yet others get close to data engineering. In my view, all of these are part of the data science role. 

I would argue data science generally is about three distinct areas:

  • Prediction. The ability to accurately extrapolate from existing data sets to make forecasts about future behavior. This is the famous machine learning aspect and includes solutions like recommender systems.
  • Distinction. The key question here is: "are these numbers different?". This includes the use of statistical techniques to decide if there's a difference or not, for example, specifying an A/B test and explaining its results. 
  • Interpretation. What are the factors that are driving the system? This is obviously related to prediction but has similarities to distinction too.

(A similar view of data science to mine: Calvin.Andrus, CC BY-SA 3.0, via Wikimedia Commons)

I'm going to talk through these areas and list the skills I think a data scientist needs. In my view, to be effective, you need all three areas. The real skill is to understand what type of problem you face and to use the correct approach.

Distinction - are these numbers different?

This is perhaps the oldest area and the one you might disagree with me on. Distinction is firmly in the realm of statistics. It's not just about A/B tests or quasi-experimental tests, it's also about evaluating models too.

Here's what you need to know:

  • Confidence intervals.
  • Sample size calculations. This is crucial and often overlooked by experienced data scientists. If your data set is too small, you're going to get junk results so you need to know what too small is. In the real world. increasing the sample size is often not an option and you need to know why.
  • Hypothesis testing. You should know the difference between a t-test and a z-test and when a z-test is appropriate (hint: sample size).
  • α, β, and power. Many data scientists have no idea what statistical power is. If you're doing any kind of statistical testing, you need to have a firm grasp of power.
  • The requirements for running a randomized control trial (RCT). Some experienced data scientists have told me they were analyzing results from an RCT, but their test just wasn't an RCT - they didn't really understand what an RCT was.
  • Quasi-experimental methods. Sometimes, you just can't run an RCT, but there are other methods you can use including difference-in-difference, instrumental variables, and regression discontinuity.  You need to know which method is appropriate and when. 
  • Regression to the mean. This is why you almost always need a control group. I've seen experienced data scientists present results that could almost entirely be explained by regression to the mean. Don't be caught out by one of the fundamentals of statistics.

Prediction - what will happen next?

This is the piece of data science that gets all the attention, so I won't go into too much detail.

Here's what you need to know:

  • The basics of machine learning models, including:
    • Generalized linear modeling
    • Random forests (including knowing why they are often frowned upon)
    • k-nearest neighbors/k-means clustering
    • Support Vector Machines
    • Gradient boosting.
  • Cross-validation, regularization, and their limitations.
  • Variable importance and principal component analysis.
  • Loss functions, including RMSE.
  • The confusion matrix, accuracy, sensitivity, specificity, precision-recall and ROC curves.

There's one topic that's not on any machine learning course or in any machine learning book that I've ever read, but it's crucially important: knowing when machine learning fails and when to stop a project.  Machine learning doesn't work all the time.

Interpretation - what's going on?

The main techniques here are often data visualization. Statistical summaries are great, but they can often mislead. Charts give a fuller picture. 

Here are some techniques all data scientists should know:

  • Heatmaps
  • Violin plots
  • Scatter plots and curve fitting
  • Bar charts
  • Regression and curve fitting.

They should also know why pie charts in all their forms are bad. 

A good knowledge of how charts work is very helpful too (the psychology of visualization).

What about SQL and R and Python...?

You need to be able to manipulate data to do data science, which means SQL, Python, or R. But plenty of people use these languages without being data scientists. In my view, despite their importance, they're table stakes.

Book knowledge vs. street knowledge

People new to data science tend to focus almost exclusively on machine learning (prediction in my terminology) which leaves them very weak on data analysis and data exploration; even worse, their lack of statistical knowledge sometimes leads them to make blunders on sample size and loss functions. No amount of cross-validation, regularization, or computing power will save you from poor modeling choices. Even worse, not knowing statistics can lead people to produce excellent models of regression to the mean.

Practical experience is hugely important; way more important than courses. Obviously, a combination of both is best, which is why PhDs are highly sought after; they've learned from experience and have the theoretical firepower to back up their practical knowledge.

Monday, November 29, 2021

What's a violin plot and how to make one?

What's a violin plot?

Over the last few years, violin plots and their derivatives have become a lot more common; they're a 'sort of' smoothed histogram. In this blog post, I'm going to explain what they are and why you might use them.

To give you an idea of what they look like, here's a violin plot of attendance at English Premier League (EPL) matches during the 2018-2019 season. The width of the plot indicates the relative proportion of matches with that attendance; we can see attendance peaks around 27,000, 55,000, and 75,000, but no matches had zero attendance. 

Violin plots get their name because they sort of resemble a violin (you'll have to allow some creative license here).

As we'll see, violin plots avoid the problems of box and whisker plots and the problems of histograms. The cost is greatly increased computation time, but for a modern computer system, violin plots are calculated and plotted in the blink of an eye. Despite their advantages, the computational cost is why these plots have only recently become popular.

Summary statistics - the mean etc.

We can use summary statistics to give a numerical summary of the data. The most obvious statistics are the mean and standard deviation, which are 38,181 and 16,709 respectively for the EPL 2018 attendance data. But the mean can be distorted by outliers and the standard deviation implies a symmetrical distribution. These statistics don't give us any insight into how attendance was distributed.

The median is a better measure of central tendency in the presence of outliers, and quartiles work fine for asymmetric distributions. For this data, the median is 31,948 and the upper and lower quartiles are 25,034 and 53,283. Note the median is a good deal lower than the mean and the upper and lower quartiles are not evenly spaced, suggesting a skewed distribution. The quartiles give us an indication of how skewed the data is.

So we should be just fine with median and quartiles - or are there problems with these numbers too?

Box and whisker plots

Box and whisker plots were introduced by Tukey in the early 1970s and evolved since then, currently, there are several slight variations. Here's a box and whisker plot for the EPL data for four seasons in the most common format; the median, upper and lower quartiles, and the lowest and highest values are all indicated. We have a nice visual representation of our distribution.

The box and whisker plot encodes a lot of data and strongly indicates if the distribution is skewed. Surely this is good enough?

The problem with boxplots

Unfortunately, box and whisker plots can hide important features of the data. This article from Autodesk Research gives a great example of the problems. I've copied one of their key figures here.

("Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing", Justin Matejka, George Fitzmaurice, ACM SIGCHI Conference on Human Factors in Computing Systems, 2017)

The animation shows the problem nicely; different distributions can give the same box and whisker plots. This doesn't matter most of the time, but when it does matter, it can matter a lot, and of course, the problem is hidden.

Histograms and distributions

If boxplots don't work, what about histograms? Surely they won't suffer from the same problem? It's true they don't, but they suffer from another problem; bin count, meaning the number of bins.

Let's look at the EPL data again, this time as a histogram.

Here's the histogram with a bin count of 9.

Here it is with a bin count of 37.

And finally, with a bin count of 80.

Which of these bin counts is better? The answer is, it depends on what you want to show. In some cases, you want a large number of bins, in other cases, a small number of bins. As you can appreciate, this isn't helpful, especially if you're at the start of your career and you don't have a lot of experience to call on. Even later in your career, it's not helpful when you need to move quickly.

An evolution of the standard histogram is using unequal bin sizes. Under some circumstances, this gives a better representation of the distribution, but it adds another layer of complexity; what should the bin sizes be? The answer again is, it depends on what you want to do and your experience.

Can we do better?

Enter the violin plot

The violin plot does away with bin counts by using probability density instead. It's a plot of the probability density function (pdf) that could have given rise to the measured data. 

In a handful of cases, we could estimate the pdf using parametric means when the underlying data follows a well-known distribution. Unfortunately, in most real-world cases, we don't know what distribution the data follows, so we have to use a non-parametric method, the most popular being kernel density estimation (kde). This is almost always done using a Gaussian estimator, though other estimators are possible (in my experience, the Gaussian calculation gives the best result anyway, but see this discussion on StackOverflow). The key parameter is the bandwidth, though most kde algorithms attempt to size their bandwidth automatically. From the kde, the probability density is calculated (a trivial calculation).

Here's the violin plot for the EPL 2018 data.

It turns out, violin plots don't have the problems of box plots as you can see from this animation from Autodesk Research. The raw data changes, the box plot doesn't, but the violin plot changes dramatically. This is because all of the data is represented in the violin plot.

("Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing", Justin Matejka, George Fitzmaurice, ACM SIGCHI Conference on Human Factors in Computing Systems, 2017)

Variations on a theme, types of violin plot, ridgeline plots, and mirror density plots

The original form of the violin plot included median and quartile data and you often see violin plots presented like this - a kind of hybrid of violin and box and whisker. This is how the Seaborn plotting package draws Violin plots (though this chart isn't a Seaborn visualization).

Now, let's say we want to compare the EPL attendance data over several years. One way of doing it is to show violin plots next to each other, like this.

Another way is to show two plots back-to-back, like this. This presentation is often called a mirror density plot.

We can show the results from multiple seasons with another form of violin plot called the ridgeline plot. In this form of plot, the charts are deliberately overlapped, letting you compare the shape of the distributions better. They're called ridgeline plots because they sort of look like a mountain ridgeline.

This plot shows the most obvious feature of the data, the 2019-2020 season was very different from the seasons that went before; a substantial number of matches were held with zero attendees.

When should you use violin plots?

If you need to summarize a distribution and represent all of the data, a violin plot is a good choice. Depending on what you're trying to do, it may be a better choice than a histogram and it's always a better choice than box and whisker plots.

Aesthetically, a violin plot or one of its variations is very appealing. You shouldn't underestimate the importance of clear communication and visual appeal. If you think your results are important, surely you should present them in the best way possible - and that may well be a violin plot.

Reading more

"Statistical computing and graphics, violin plots: a box plot-density trace synergism", Jerry Hintze, Ray Nelson, American Statistical Association, 1998. This paper from Hintze and Nelson is one of the earlier papers to describe violin plots and how they're defined, it goes into some detail on the process for creating them. 

"Violin plots explained" is a great blog post that provides a nice explanation of how violin plots work.

"Violin Plots 101: Visualizing Distribution and Probability Density" is another good blog post proving some explanation on how to build violin plots.

Monday, July 26, 2021

Reconstructing an unlabelled chart

What were the numbers?

Often in business, we're presented with charts where the y-axis is unlabeled because the presenter wants to conceal the numbers. Are there ways of reconstructing the labels and figuring out what the data is? Surprisingly, yes there are.

Given a chart like this:

you can often figure out what the chart values should be.

The great Evan Miller posted on this topic several years ago ("How To Read an Unlabeled Sales Chart"). He discussed two methods:

  • Greatest common divisor (gcd)
  • Poisson distribution

In this blog post, I'm going to take his gcd work a step further and present code and a process for reconstructing numbers under certain circumstances. In another blog post, I'll explain the Poisson method.

The process I'm going to describe here will only work:

  • Where the underlying data is integers
  • Where there's 'enough' range in the underlying data.
  • Where the maximum underlying data is less than about 200.
  • Where the y-axis includes zero. 

The results

Let's start with some results and the process.

I generated this chart without axes labels, the goal being to recreate the underlying data. I measured screen y-coordinates of the top and bottom plot borders (187 and 677) and I measured the y coordinates of the top of each of the bars. Using the process and code I describe below, I was able to correctly recreate the underlying data values, which were \([33, 30, 32, 23, 32, 26, 18, 59, 47]\).

How plotting packages work

To understand the method, we need to understand how a plotting package will render a set of integers on a chart.

Let's take the list of numbers \([1, 2, 3, 5, 7, 11, 13, 17, 19, 23]\) and call them \(y_o\). 

When a plotting package renders \(y_o\) on the screen, it will put them into a chart with screen x-y coordinates. It's helpful to think about the chart on the screen as a viewport with x and y screen dimensions. Because we only care about the y dimensions, that's what I'll talk about. On the screen, the viewport might go from 963 pixels to 30 pixels on the y-axis, a total range of 933 y-pixels.

Here's how the numbers \(y_o\) might appear on the screen and how they map to the viewport y-coordinates. Note the origin is top left, not bottom right. I'll "correct" for the different origin.

The plotting package will translate the numbers \(y_o\) to a set of screen coordinates I'll call \(y_s\). Assuming our viewport starts from 0, we have:

\[y_s = my_o\]

Let's just look at the longest bar that corresponds to the number 23. My measurements of the start and end are 563 and 27, which gives a length of 536. \(m\) in this case is 536/23, or 23.3.

There are three things to bear in mind:

  • The set of numbers \(y_o\) are integers
  • The set of numbers \(y_s\) are integers - we can't have half a pixel for example.
  • The scalar \(m\) is a real number

Integer only solutions for \(m\) 

In Evan Miller's original post, he only considered integer values of \(m\). If we restrict ourselves to integers, then most of the time:

\[m = gcd(y_s)\]

where gcd is the greatest common divisor.

To see how this works, let's take:

\[y_o = [1 , 2,  3]\]


\[m = 8\]

These numbers give us:

\[y_s = [8, 16, 24]\]

To find the gcd in Python:

np.gcd.reduce([8, 16, 24])

which gives \(m = 8\), which is correct.

If we could guarantee \(m\) was an integer, we'd have an answer; we'd be able to reconstruct the original data just using the gcd function. But we can't do that in practice for three reasons:
  1. \(m\) isn't always an integer.
  2. There are measurement errors which mean there will be some uncertainty in our \(y_s\) values.
  3. It's possible the original data set \(y_o\) has a gcd which is not 1.

In practice, we gather screen coordinates using a manual process which will introduce errors. At most, we're likely to be off by a few pixels for each measurement, however, even the smallest error will mean the gcd method won't work. For example, if the value on the screen should be 500 but we might incorrectly measure it as 499, this small error means the method fails (there is a way around this failure that will work for small measurement errors.)

If our original data set has a gcd greater than 1, the method won't work. Let's say our data was:

\[y_o = [2, 4, 6] \]



we would have:

\[y_s = [16, 32, 48]\]

which has a gcd of 16, which is an incorrect estimate of \(m\). In practice, the odds of the original data set \(y_o\) having a gcd > 1 are low.

The real killer for this approach is the fact that \(m\) is highly likely in practice to be a real number.

Real solutions for \(m\)

The only way I've found for solving for \(m\) is to try different values for \(m\) to see what succeeds. To get this to work, we have to constrain \(m\) because otherwise there would be an infinite number of values to try. Here's how I constrain \(m\):

  • I limit the steps for different \(m\) values to 0.01.
  • I start my m values from just over 1 and I stop at a maximum \(m\) value. My maximum \(m\) value I get from assuming the smallest value I measure on the screen corresponds to a data value of 1, for example, if the smallest measurement is 24 pixels, the smallest possible original data is 1, so the maximum value for \(m\) is 24. 

Now we've constrained \(m\), how do we evaluate \(y_s = my_o\)? First off, we define an error function. We want our estimates of the original data \(y_o\) to be integers, so the further away we are from an integer, the worse the error. For the \(i\)th element of our estimate of \(y_o\), the error estimate is:

\[\frac{y_{si}}{m_{estimate}} -  \frac{y_{si}}{m_{estimate}}\]

we're choosing the least square error, which means minimizing:

\[ \frac{1}{n} \sum  \left ( round \left ( \frac{y_{si}}{m_{estimate}} \right ) -  \frac{y_{si}}{m_{estimate}} \right )^2 \]

in code, this comes out as:

sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

Our goal is to try different values of \(m\) and choose the solution that yields the lowest error estimate.

The solution in practice

Before I show you how this works, there are two practicalities. The first is that \(m=1\) is always a solution and will always give a zero error, but it's probably not the right solution, so we're going to ignore \(m=1\). Secondly, there will be an error in our measurements due to human error. I'm going to assume the maximum error is 3 pixels for any measurement. To calculate a length, we take a measurement of the start and end of the bar (if it's a bar chart), which means our maximum uncertainty is 2*3. That's why I set my maximum \(m\) to be min(y) + 2*MAX_ERROR.

To show you how this works, I'll talk you through an example.

The first step is measurement. We need to measure the screen y-coordinates of the plot borders and the top of the bars (or the position of the points on a scatter chart). If the plot doesn't have borders, just measure the position of the bottom of the bars and the coordinate of the highest bar. Here are some measurements I took.

Here are the measurements of the top of the bars (_y_measured): \([482, 500, 489, 541, 489, 523, 571, 329, 399]\)

Here are the start and stop coordinates of the plot borders (_start, _stop):  \(677, 187\)

To convert these to lengths, the code is just: [_start - _y_m for _y_m in _y_measured]

The length of the screen from the top to the bottom is: _start - _stop = \(490\)

This gives us measured length (y_measured): \([195, 177, 188, 136, 188, 154, 106, 348, 278]\)

Now we run this code:


STEP = 0.01


def mse(y, div):

    """Means square error calculation."""

    return sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

def find_divider(y):

    """Return the non-integer that minimizes the error function."""

    error_list = []  

    for _div in np.arange(1 + STEP, 

                          min(y) + 2*MAX_ERROR, 


        error_list.append({"divider": _div, 

                           "error":mse(y, _div)})

    df_error = pd.DataFrame(error_list)

    df_error.plot(x='divider', y='error', kind='scatter')

    _slice = df_error[df_error['error'] == df_error['error'].min()]

    divider = _slice['divider'].to_list()[0]

    error = _slice['error'].to_list()[0]

    if error > ERROR_THRESHOLD:

        raise ValueError('The estimated error is {0} which is '

                          'too large for a reliable result.'.format(error))

    return divider

def find_estimate(y, y_extent):

    """Make an estimate of the underlying data."""

    if (max(y_measured) - min(y_measured))/y_extent < 0.1:

        raise ValueError('Too little range in the data to make an estimate.')  

    m = find_divider(y)

    return [round(_e/m) for _e in y_measured], m

estimate, m = find_estimate(y_measured, y_extent)

This gives us this output:

Original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Measured y values: [195, 177, 188, 136, 188, 154, 106, 348, 278]

Divider (m) estimate: 5.900000000000004

Estimated original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Which is correct.

Limitations of this approach

Here's when it won't work:

  • If there's little variation in the numbers on the chart, then measurement errors tend to overwhelm the calculations and the results aren't good.
  • In a similar vein, if the numbers are all close to the top or the bottom of the chart, measurement errors lead to poor results.
  • \(m < 1\), which as the maximum y viewport range is usually in the range 500-900 pixels, it won't work for numbers greater than about 500.
  • I've found in practice that if \(m < 3\) the results can be unreliable. Arbitrarily, I call any error greater than 0.01 too high to protect against poor results. Maybe, I should limit the results to \(m > 3\).

I'm not entirely convinced my error function is correct; I'd like an error function that better discriminates between values. I tried a couple of alternatives, but they didn't give good results. Perhaps you can do better.

Notice that the error function is 'denser' closer to 1, suggesting I should use a variable step size or a different algorithm. It might be that the closer you get to 1, the more errors and the effects of rounding overwhelm the calculation. I've played around with smaller step sizes and not had much luck.

Future work

If the data is Poisson distributed, there's an easier approach you can take. In a future blog post, I'll talk you through it.

Where to get the code

I've put the code on my Github page here:

Monday, June 21, 2021

Unknown Pleasures: pulsars, pop, and plotting

The echoes of history

Sometimes, there are weird connections between very different cultural areas and we see the echoes of history playing out. I'm going to tell you how pulsars, Nobel Prizes, an iconic album cover, Nazi atrocities, and software chart plotting all came to be connected.

The discovery of pulsars

In 1967, Jocelyn Bell was working on her Ph.D. as a post-graduate researcher at the Mullard Radio Astronomy Observatory, near Cambridge in the UK. She had helped build a new radio telescope and now she was operating it. On November 28, 1967, she saw a strikingly unusual and regular signal, which the team nicknamed "little green men". The signal turned out to be a pulsar, a type of star new to science.

This was an outstanding discovery that shook up astronomy. The team published a paper in Nature, but that wasn't the end of it. In 1974, the Nobel committee awarded the Nobel Physics Prize to the team. To everyone on the team except Jocelyn Bell

Over the years, there's been a lot of controversy over the decision, with many people thinking she was robbed of her share of the prize, either because she was a Ph.D. student or because she was a woman. Bell herself has been very gracious about the whole thing; she is indeed a very classy lady.

The pulsar and early computer graphics

In the late 1960s, a group of Ph.D. students from Cornell University were analyzing data from the pulsar Bell discovered. Among them was Harold Craft, who used early computer systems to visualize the results. Here's what he said to the Scientific American in 2015: "I found that it was just too confusing. So then, I wrote the program so that I would block out when a hill here was high enough, then the stuff behind it would stay hidden."

Here are three pages from Craft's Ph.D. thesis. Take a close look at the center plot. If Craft had made every line visible, it would have been very difficult to see what was going on. Craft re-imagined the data as if he were looking at it at an angle, for example, as if it were a mountain range ridgeline he was looking down on.  With a mountain ridgeline, the taller peaks hide what's behind them. It was a simple idea, but very effective.


The center plot is very striking. So striking in fact, that it found its way into the Cambridge Encyclopaedia of Astronomy (1977 edition):

(Cambridge Encyclopedia of Astronomy, 1977 edition, via Tim O'Riley)

Joy Division

England in the 1970s was not a happy place, especially in the de-industrialized north. Four young men in Manchester had formed a band and recorded an album. The story goes that one of them, Bernard Sumner, was working in central Manchester and took a break in the city library. He came across the pulsar image in the encyclopedia and liked it a lot.

The band needed an image for their debut album, so they selected this one. They gave it to a recently graduated designer called Peter Saville, with the instructions it was to be a black on white image. Saville felt the image would look better white on black, so he designed this cover.

This is the iconic Unknown Pleasures album from Joy Division.  

The starkness of the cover, without the band's name or the album's name, set it apart. The album itself was critically acclaimed, but it never rose high in the charts at the time. However, over time, the iconic status of the band and the album cover grew. In 1980, the lead singer, Ian Curtis, committed suicide. The remaining band members formed a new band, New Order, that went on to massive international fame.

By the 21st century, versions of the album cover were on beach towels, shoes, and tattoos.

Joy plots

In 2017, Claus Wilke created a new charting library for R, ggjoy.  His package enabled developers to create plots like the famous Unknown Pleasures album cover. In honor of the album cover, he called these plots joy plots.

Ridgeline plots

This story has a final twist to it. Although joy plots sound great, there's a problem.

Joy Division took their name from a real Nazi atrocity fictionalized in a book called House of Dolls. In some of their concentration camps, the Nazis forced women into prostitution. The camp brothels were called Joy Division

The name joy plots was meant to be fun and a callback to an iconic data visualization, but there's little joy in evil. Given this history, Wilke renamed his package ggridges and the plots ridgeline plots. 

Here's an example of the great visualizations you can produce with it. 

If you search around online, you can find people who've re-created the pulsar image using ggridges.

It's not just R programmers who are playing with Unknown Pleasures, Python programmers have got into the act too. Nicolas P. Rougier created a great animation based on the pulsar data set using the venerable Matplotlib plotting package - you can see the animation here.

If you liked this post

You might like these ones:

Monday, March 8, 2021

A masterclass in information visualization: the tube map

Going underground

The London Underground tube map is a master class in information visualization. It's been described in detail in many, many places, so I'm just going to give you a summary of why it's so special and what we can learn from it. Some of the lessons are about good visual design principles, some are about the limitations of design, but some of them are about wealth and poverty and the unintended consequences of abstraction.

(London Underground map.)

The problem

Starting in 1863, the underground train system in London grew in a haphazard fashion, with different railway companies building different lines and no sense of creating a coherent system. 

Despite the disorder, when it was first built it was viewed as a marvel and had a cultural impact beyond just transport; Conan Doyle wove it into Sherlock Holmes stories, H.G. Wells created science fiction involving it, and Virginia Woolf and others wrote of it too.

After various financial problems, the system was unified under government control. The government authority running it wanted to promote its use to reduce street-level congestion but the problem was, there were many different lines that only served part of the capital. Making it easy to use the system was hard.

Here's an early map of the system so you can see the problem.

1908 tube map

(1908 tube map. Image source: Wikimedia Commons.)

The map's hard to read and it's hard to follow. It's visually very cluttered and there are lots of distracting details; it's not clear why some things are marked on the map at all (why is ARMY & NAVY AND AUXILLARY STORES marked so prominently?). The font is hard to read, the text orientation is inconsistent, and the contrast of station names with the background isn't high enough.

The problem gets even worse when you zoom out to look at the entire system. Bear in mind, stations in central London are close together but they get further apart as you go into the suburbs. Here's an early map of the entire system, do you think you could navigate it?

(1931 whole system tube map.)

Of course, printing technology of the time was more limited than it is now, which made information representation harder.

Design ideas in culture

To understand how the tube map as we know it was created, we have to understand a little of the design culture of the time (the early 1930s).

Electrical engineering was starting as a discipline and engineers were creating circuit diagrams for new electrical devices. These circuit diagrams showed the connection between electrical components, not how they were laid out on a circuit board. Circuit diagrams are examples of topological maps.

(Example circuit diagram. Show electrical connections between components, not how they're laid out on a circuit board. Image source: Wikimedia Commons, License: Public domain.)

The Bauhaus school in Germany was emphasizing art and design in mass-produced items, bring high-quality design aesthetics into everyday goods. Ludwig Mies van der Rohe, the last director of the Bauhaus school, used a key aphorism that summarized much of their design philosophy: "less is more".

(Bauhaus kitchen design 1928 - they invented much of the modern design world. Image source: Wikimedia Commons, License: Public domain)

The modern art movement was in full swing, with the principles of abstraction coming very much to the fore. Artists were abstracting from reality in an attempt to represent an underlying truth about their subjects or about the world.

(Piet Mondrian, Composition 10. Image source: Wikimedia Commons, License: Public Domain.)

To put it simply, the early 1930s were a heyday of design that created much of our modern visual design language.

Harry Beck's solution - form follows function

In 1931, Harry Beck, a draughtsman for London Underground, proposed a new underground map. Beck's map was clearly based on circuit diagrams: it removed unnecessary detail to focus on what was necessary. In Beck's view, what was necessary for the tube was just the stations and the lines, plus a single underlying geographical detail, the river Thames.

Here's his original map. There's a lot here that's very, very different from the early geographical maps.

The design grammar of the tube map

The modern tube map is a much more complex beast, but it still retains the ideas Harry Beck created. For simplicity, I'm going to use the modern tube map to explain Beck's design innovations. There is one underlying and unifying idea behind everything I'm going to describe: consistency.

Topological not geographical.This is the key abstraction and it was key to the success of Beck's original map. On the ground, tube lines snake around and follow paths determined by geography and the urban landscape. This makes the relationship between tube lines confusing. Beck redrew the tube lines as straight lines without attempting to preserve the geographic relations of tube lines to one another. He made the stations more or less equidistant from each other, whereas, on the ground, the distance between stations varies widely. 

The two images below show the tube map and a geographical representation of the same map. Note how the tube map substantially distorts the underlying geography.

(The tube map. Image source: BBC.)

(A geographical view of the same system. Image source: Wikimedia Commons.)

Removal of almost all underlying geographical features. The only geographical feature on tube maps is the river Thames. Some versions of the tube map removed it, but the public wanted it put back in, so it's been a consistent feature for years now.

(The river Thames, in blue, is the only geographic feature on the map.)

A single consistent font.  Station names are written with the same orientation. Using the same font and the same text orientation makes reading the map easier. The tube has its own font, New Johnston, to give a sense of corporate identity.

(Same text orientation, same font everywhere.)

High contrast. This is something that's become easier with modern printing technology and good quality white paper. But there are problems. The tube uses a system of fare zones which are often added to the map (you can see them in the first two maps in this section, they're the gray and white bands). Although this is important information if you're paying for your tube ticket, it does add visual clutter. Because of the number of stations on the system, many modern maps add a grid so you can locate stations. Gridlines are another cluttering feature.

Consistent symbols. The map uses a small set of symbols consistently. The symbol for a station is a 'tick' (for example, Goodge Street or Russell Square). The symbol for a station that connects two or more lines is a circle (for example, Warren Street or Holborn).

Graphical rules. Angles and curves are consistent throughout the map, with few exceptions - clearly, the map was constructed using a consistent set of layout rules. For example, tube lines are shown as horizontal, vertical, or 45-degree lines in almost all cases.

The challenge for the future

The demand for mass transit in London has been growing for very many years which means London Underground is likely to have more development over time (new lines, new stations). This poses challenges for map makers.

The latest underground maps are much more complicated than Harry Beck's original, newer maps incorporate the south London tram system, some overground trains, and of course the new Elizabeth Line. At some point, a system becomes so complex that even an abstract simplification becomes too complex. Perhaps we'll need a map for the map.

A trap for the unwary

The tube map is topological, not geographical. On the map, tube stations are roughly the same distance apart, something that's very much not the case on the ground.

Let's imagine you had to go from Warren Street to Great Portland Street. How would you do it? Maybe you would get the Victoria Line southbound to Oxford Circus, change to the Bakerloo Line northbound, change again at Baker Street, and get the Circle Line eastbound to Great Portland Street. That's a lot of changes and trains. Why not just walk from Warren Street to Great Portland Street? They're less than 500m apart and you can do the walk in less than 5 minutes. The tube map misleads people into doing stuff like this all the time.

Let's imagine it's a lovely spring day and you're traveling to Chesham on the Metropolitan Line. If Great Portland Street and Warren Street are only 482m apart, then it must be a nice walk between Chalfont & Latimer and Chesham, especially as they're out in the leafy suburbs. Is this a good idea? Maybe not. These stations are 6.19km apart.

Abstractions are great, but you need to understand that's exactly what they are and how they can mislead you.

Using the map to represent data

The tube map is an icon, not just of the tube system, but for London itself. Because of its iconic status, researchers have used it as a vehicle to represent different data about the city.

James Cheshire of University College London mapped life expectancy data to tube stations, the idea being, you can spot health disparities between different parts of the city. He produced a great map you can visit at Here's a screenshot of part of his map.

You go from a life expectancy of 78 at Stockwell to 89 at Green Park, but the two stations are just 4 stops apart. His map shows how disparities occur across very short distances.

Mark Green of the University of Sheffield had a similar idea, but this time using a more generic deprivation score. Here's his take on deprivation and the tube map, the bigger circles representing higher deprivation.

Once again, we see the same thing, big differences in deprivation over short distances.

What the tube map hides

Let me show you a geographical layout of the modern tube system courtesy of Wikimedia. Do you spot what's odd about it?

(Geographical arrangement of tube lines. Image source: Wikimedia Commons, License: Creative Commons.)

Look at the tube system in southeast London. What tube system? There are no tube trains in southeast London. North London has lots of tube trains, southwest London has some, and southeast London has none at all. What part of London do you think is the poorest?

The tube map was never designed to indicate wealth and poverty, but it does that. It clearly shows which parts of London were wealthy enough to warrant underground construction and which were not. Of course, not every area in London has a tube station, even outside the southeast of London. Cricklewood (population 80,000) in northwest London doesn't have a tube station and is nowhere to be seen on the tube map. 

The tube map leaves off underserved areas entirely, it's as if southeast London (and Cricklewood and other places) don't exist. An abstraction meant to aid the user makes whole communities invisible.

Now look back at the previous section and the use of the tube map to indicate poverty and inequality in London. If the tube map is an iconic representation of London, what does that say about the areas that aren't even on the map? Perhaps it's a case of 'out of sight, out of mind'.

This is a clear reminder that information design is a deeply human endeavor. A value-neutral expression of information doesn't exist, and maybe we shouldn't expect it to.

Takeaways for the data scientist

As data scientists, we have to visualize data, not just for our fellow data scientists, but more importantly for the businesses we serve. We have to make it easy to understand and easy to interpret data. The London Underground tube map shows how ideas from outside science (circuit diagrams, Bauhaus, modernism) can help - information representation is, after all, a human endeavor. But the map shows the limits to abstraction and how we can be unintentionally led stray. 

The map also shows the hidden effects of wealth inequality and the power of exclusion - what we do does not exist in a cultural vacuum, which is true for both the tube map and the charts we produce too.

Monday, January 25, 2021

3D plotting: how hard can it be?

Why aren't 2D plots good enough?

Most data visualization problems involve some form of two-dimensional plotting, for example plotting sales by month. Over the last two hundred years, analysts have developed several different types of 2D plots, including scatter charts, line charts, and bar charts, so we have all the chart types we need for 2D data. But what happens if we have a 3D dataset? 

The dataset I'm looking at is English Premier League (EPL) results. I want to know how the full-time scores are distributed, for example, are there more 1-1 results than 2-1 results? I have three numbers, the full-time home goals (FTHG), the full-time away goals (FTAG). and the number of games that had that score. How can I present this 3D data in a meaningful way? 

(You can't rely on 3D glasses to visualize 3D data. Image source: Wikimedia Commons, License: Creative Commons, Author: Oliver Olschewski)

Just the text

The easiest way to view the data is to create a table, so here it is. The columns are the away goals, the rows are the home goals, and the cell values are the number of matches with that result, so 778 is the number of matches with a score of 0-1.

This presentation is easy to do, and relatively easy to interpret. I can see 1-1 is the most popular score, followed by 1-0. You can also see that some scores just don't occur (9-9) and results with more than a handful of goals are very uncommon.

This is OK for a smallish dataset like this, but if there are hundreds of rows and/or columns, it's not really viable. So what can we do?


A heatmap is a 2D map where the 3rd dimension is represented as color. The more intense (or lighter) the color, the higher the value. For this kind of plot to work, you do have to be careful about your color map. Usually, it's best to choose the intensity of just one color (e.g. shades of blue). In a few cases, multiple colors can work (colors for political parties), but those are the exceptions. 

Here's the same data plotted as a heatmap using the Brewer color palette "RdPu" (red-purple).

The plot does clearly show the structure. It's obvious there's a diagonal line beyond which no results occur. It's also obvious which scores are the most common. On the other hand, it's hard to get a sense of how quickly the frequency falls off because the human eye just isn't that sensitive to variations in color, but we could probably play around with the color scale to make the most important color variation occur over the range we're interested in. 

This is an easy plot to make because it's part of R's ggplot package. Here's my code:

plt_goal_heatmap <- goal_distribution %>% 
  ggplot(aes(FTHG, FTAG, fill=Matches)) + 
  geom_tile() +   
  scale_fill_distiller(palette = "RdPu") +
  ggtitle("Home/Away goal heatmap")

Perspective scatter plot

Another alternative is the perspective plot, which in R, you can create using the 'persp' function. This is a surface plot as you can see below.
You can change your perspective on the plot and view it from other angles, but even from this perspective, it's easy to see the very rapid falloff in frequency as the scores increase. 

However, I found this plot harder to use than the simple heatmap, and I found changing my viewing angle was awkward and time-consuming.

Here's my code in case it's useful to you:

persp(x = seq(0, max(goal_distribution$FTHG)), 
      y = seq(0, max(goal_distribution$FTAG)), 
      z = as.matrix(
            goal_distribution, FTAG, Matches, fill=0)[,-1])), 
      xlab = "FTHG", ylab = "FTAG", zlab = "Matches", 
      main = "Distribution of matches by score",
      theta = 60, phi = 20, 
      expand = 1, 
      col = "lightblue")

3D scatter plot

We can go one stage further and create a 3D scatter chart. On this chart, I've plotted the x, y, and z values and color coded them so you get a sense of the magnitude of the z values. I've also connected the points to the axis (the zero plane if you like) to emphasize the data structure a bit more.

As with the persp function,  you can change your perspective on the plot and view it from another angle.

The downside with this approach is it requires the 'plot3D' library in R and it requires you to install a new graphics server (XQuartz). It's a chunk of work to get to a visualization. The function to draw the plot is 'scatter3D'. Here's my code:

          xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
          phi = 5, 
          theta = 40,
          bty = "g",  
          type = "h", 
          pch = 19,
          main="Distribution of matches by score",
          cex = 0.5)

What's my choice?

My goal was to understand the distribution of goals in the EPL, so what presentations of the data were most useful to me?

The simple table worked well and was the most informative, followed by the heatmap. I found both persp and scatter3D to be awkward to use and both consumed way more time than they were worth. The nice thing about the heatmap is that it's available as part of the wonderful ggplot library.

Bottom line: keep it simple.