Showing posts with label data visualization. Show all posts
Showing posts with label data visualization. Show all posts

Monday, July 26, 2021

Reconstructing an unlabelled chart

What were the numbers?

Often in business, we're presented with charts where the y-axis is unlabeled because the presenter wants to conceal the numbers. Are there ways of reconstructing the labels and figuring out what the data is? Surprisingly, yes there are.

Given a chart like this:

you can often figure out what the chart values should be.

The great Evan Miller posted on this topic several years ago ("How To Read an Unlabeled Sales Chart"). He discussed two methods:

  • Greatest common divisor (gcd)
  • Poisson distribution

In this blog post, I'm going to take his gcd work a step further and present code and a process for reconstructing numbers under certain circumstances. In another blog post, I'll explain the Poisson method.

The process I'm going to describe here will only work:

  • Where the underlying data is integers
  • Where there's 'enough' range in the underlying data.
  • Where the maximum underlying data is less than about 200.
  • Where the y-axis includes zero. 

The results

Let's start with some results and the process.

I generated this chart without axes labels, the goal being to recreate the underlying data. I measured screen y-coordinates of the top and bottom plot borders (187 and 677) and I measured the y coordinates of the top of each of the bars. Using the process and code I describe below, I was able to correctly recreate the underlying data values, which were \([33, 30, 32, 23, 32, 26, 18, 59, 47]\).

How plotting packages work

To understand the method, we need to understand how a plotting package will render a set of integers on a chart.

Let's take the list of numbers \([1, 2, 3, 5, 7, 11, 13, 17, 19, 23]\) and call them \(y_o\). 

When a plotting package renders \(y_o\) on the screen, it will put them into a chart with screen x-y coordinates. It's helpful to think about the chart on the screen as a viewport with x and y screen dimensions. Because we only care about the y dimensions, that's what I'll talk about. On the screen, the viewport might go from 963 pixels to 30 pixels on the y-axis, a total range of 933 y-pixels.

Here's how the numbers \(y_o\) might appear on the screen and how they map to the viewport y-coordinates. Note the origin is top left, not bottom right. I'll "correct" for the different origin.

The plotting package will translate the numbers \(y_o\) to a set of screen coordinates I'll call \(y_s\). Assuming our viewport starts from 0, we have:

\[y_s = my_o\]

Let's just look at the longest bar that corresponds to the number 23. My measurements of the start and end are 563 and 27, which gives a length of 536. \(m\) in this case is 536/23, or 23.3.

There are three things to bear in mind:

  • The set of numbers \(y_o\) are integers
  • The set of numbers \(y_s\) are integers - we can't have half a pixel for example.
  • The scalar \(m\) is a real number

Integer only solutions for \(m\) 

In Evan Miller's original post, he only considered integer values of \(m\). If we restrict ourselves to integers, then most of the time:

\[m = gcd(y_s)\]

where gcd is the greatest common divisor.

To see how this works, let's take:

\[y_o = [1 , 2,  3]\]

and

\[m = 8\]

These numbers give us:

\[y_s = [8, 16, 24]\]

To find the gcd in Python:

np.gcd.reduce([8, 16, 24])

which gives \(m = 8\), which is correct.

If we could guarantee \(m\) was an integer, we'd have an answer; we'd be able to reconstruct the original data just using the gcd function. But we can't do that in practice for three reasons:
  1. \(m\) isn't always an integer.
  2. There are measurement errors which mean there will be some uncertainty in our \(y_s\) values.
  3. It's possible the original data set \(y_o\) has a gcd which is not 1.

In practice, we gather screen coordinates using a manual process which will introduce errors. At most, we're likely to be off by a few pixels for each measurement, however, even the smallest error will mean the gcd method won't work. For example, if the value on the screen should be 500 but we might incorrectly measure it as 499, this small error means the method fails (there is a way around this failure that will work for small measurement errors.)

If our original data set has a gcd greater than 1, the method won't work. Let's say our data was:

\[y_o = [2, 4, 6] \]

and:

\[m=8\]

we would have:

\[y_s = [16, 32, 48]\]

which has a gcd of 16, which is an incorrect estimate of \(m\). In practice, the odds of the original data set \(y_o\) having a gcd > 1 are low.

The real killer for this approach is the fact that \(m\) is highly likely in practice to be a real number.

Real solutions for \(m\)

The only way I've found for solving for \(m\) is to try different values for \(m\) to see what succeeds. To get this to work, we have to constrain \(m\) because otherwise there would be an infinite number of values to try. Here's how I constrain \(m\):

  • I limit the steps for different \(m\) values to 0.01.
  • I start my m values from just over 1 and I stop at a maximum \(m\) value. My maximum \(m\) value I get from assuming the smallest value I measure on the screen corresponds to a data value of 1, for example, if the smallest measurement is 24 pixels, the smallest possible original data is 1, so the maximum value for \(m\) is 24. 

Now we've constrained \(m\), how do we evaluate \(y_s = my_o\)? First off, we define an error function. We want our estimates of the original data \(y_o\) to be integers, so the further away we are from an integer, the worse the error. For the \(i\)th element of our estimate of \(y_o\), the error estimate is:

\[\frac{y_{si}}{m_{estimate}} -  \frac{y_{si}}{m_{estimate}}\]

we're choosing the least square error, which means minimizing:

\[ \frac{1}{n} \sum  \left ( round \left ( \frac{y_{si}}{m_{estimate}} \right ) -  \frac{y_{si}}{m_{estimate}} \right )^2 \]

in code, this comes out as:

sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

Our goal is to try different values of \(m\) and choose the solution that yields the lowest error estimate.

The solution in practice

Before I show you how this works, there are two practicalities. The first is that \(m=1\) is always a solution and will always give a zero error, but it's probably not the right solution, so we're going to ignore \(m=1\). Secondly, there will be an error in our measurements due to human error. I'm going to assume the maximum error is 3 pixels for any measurement. To calculate a length, we take a measurement of the start and end of the bar (if it's a bar chart), which means our maximum uncertainty is 2*3. That's why I set my maximum \(m\) to be min(y) + 2*MAX_ERROR.

To show you how this works, I'll talk you through an example.

The first step is measurement. We need to measure the screen y-coordinates of the plot borders and the top of the bars (or the position of the points on a scatter chart). If the plot doesn't have borders, just measure the position of the bottom of the bars and the coordinate of the highest bar. Here are some measurements I took.

Here are the measurements of the top of the bars (_y_measured): \([482, 500, 489, 541, 489, 523, 571, 329, 399]\)

Here are the start and stop coordinates of the plot borders (_start, _stop):  \(677, 187\)

To convert these to lengths, the code is just: [_start - _y_m for _y_m in _y_measured]

The length of the screen from the top to the bottom is: _start - _stop = \(490\)

This gives us measured length (y_measured): \([195, 177, 188, 136, 188, 154, 106, 348, 278]\)

Now we run this code:

MAX_ERROR = 3

STEP = 0.01

ERROR_THRESHOLD = 0.01


def mse(y, div):

    """Means square error calculation."""

    return sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)


def find_divider(y):

    """Return the non-integer that minimizes the error function."""

    error_list = []  

    for _div in np.arange(1 + STEP, 

                          min(y) + 2*MAX_ERROR, 

                          STEP):

        error_list.append({"divider": _div, 

                           "error":mse(y, _div)})

    df_error = pd.DataFrame(error_list)

    df_error.plot(x='divider', y='error', kind='scatter')

    _slice = df_error[df_error['error'] == df_error['error'].min()]

    divider = _slice['divider'].to_list()[0]

    error = _slice['error'].to_list()[0]

    if error > ERROR_THRESHOLD:

        raise ValueError('The estimated error is {0} which is '

                          'too large for a reliable result.'.format(error))

    return divider


def find_estimate(y, y_extent):

    """Make an estimate of the underlying data."""

    if (max(y_measured) - min(y_measured))/y_extent < 0.1:

        raise ValueError('Too little range in the data to make an estimate.')  

    m = find_divider(y)

    return [round(_e/m) for _e in y_measured], m

estimate, m = find_estimate(y_measured, y_extent)

This gives us this output:

Original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Measured y values: [195, 177, 188, 136, 188, 154, 106, 348, 278]

Divider (m) estimate: 5.900000000000004

Estimated original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Which is correct.

Limitations of this approach

Here's when it won't work:

  • If there's little variation in the numbers on the chart, then measurement errors tend to overwhelm the calculations and the results aren't good.
  • In a similar vein, if the numbers are all close to the top or the bottom of the chart, measurement errors lead to poor results.
  • \(m < 1\), which as the maximum y viewport range is usually in the range 500-900 pixels, it won't work for numbers greater than about 500.
  • I've found in practice that if \(m < 3\) the results can be unreliable. Arbitrarily, I call any error greater than 0.01 too high to protect against poor results. Maybe, I should limit the results to \(m > 3\).

I'm not entirely convinced my error function is correct; I'd like an error function that better discriminates between values. I tried a couple of alternatives, but they didn't give good results. Perhaps you can do better.

Notice that the error function is 'denser' closer to 1, suggesting I should use a variable step size or a different algorithm. It might be that the closer you get to 1, the more errors and the effects of rounding overwhelm the calculation. I've played around with smaller step sizes and not had much luck.

Future work

If the data is Poisson distributed, there's an easier approach you can take. In a future blog post, I'll talk you through it.

Where to get the code

I've put the code on my Github page here: https://github.com/MikeWoodward/CodeExamples/blob/master/UnlabeledChart/approxrealgcd.py

Monday, June 21, 2021

Unknown Pleasures: pulsars, pop, and plotting

The echoes of history

Sometimes, there are weird connections between very different cultural areas and we see the echoes of history playing out. I'm going to tell you how pulsars, Nobel Prizes, an iconic album cover, Nazi atrocities, and software chart plotting all came to be connected.

The discovery of pulsars

In 1967, Jocelyn Bell was working on her Ph.D. as a post-graduate researcher at the Mullard Radio Astronomy Observatory, near Cambridge in the UK. She had helped build a new radio telescope and now she was operating it. On November 28, 1967, she saw a strikingly unusual and regular signal, which the team nicknamed "little green men". The signal turned out to be a pulsar, a type of star new to science.

This was an outstanding discovery that shook up astronomy. The team published a paper in Nature, but that wasn't the end of it. In 1974, the Nobel committee awarded the Nobel Physics Prize to the team. To everyone on the team except Jocelyn Bell

Over the years, there's been a lot of controversy over the decision, with many people thinking she was robbed of her share of the prize, either because she was a Ph.D. student or because she was a woman. Bell herself has been very gracious about the whole thing; she is indeed a very classy lady.

The pulsar and early computer graphics

In the late 1960s, a group of Ph.D. students from Cornell University were analyzing data from the pulsar Bell discovered. Among them was Harold Craft, who used early computer systems to visualize the results. Here's what he said to the Scientific American in 2015: "I found that it was just too confusing. So then, I wrote the program so that I would block out when a hill here was high enough, then the stuff behind it would stay hidden."

Here are three pages from Craft's Ph.D. thesis. Take a close look at the center plot. If Craft had made every line visible, it would have been very difficult to see what was going on. Craft re-imagined the data as if he were looking at it at an angle, for example, as if it were a mountain range ridgeline he was looking down on.  With a mountain ridgeline, the taller peaks hide what's behind them. It was a simple idea, but very effective.

(Credit: JEN CHRISTIANSEN/HAROLD D. CRAFT)

The center plot is very striking. So striking in fact, that it found its way into the Cambridge Encyclopaedia of Astronomy (1977 edition):

(Cambridge Encyclopedia of Astronomy, 1977 edition, via Tim O'Riley)

Joy Division

England in the 1970s was not a happy place, especially in the de-industrialized north. Four young men in Manchester had formed a band and recorded an album. The story goes that one of them, Bernard Sumner, was working in central Manchester and took a break in the city library. He came across the pulsar image in the encyclopedia and liked it a lot.

The band needed an image for their debut album, so they selected this one. They gave it to a recently graduated designer called Peter Saville, with the instructions it was to be a black on white image. Saville felt the image would look better white on black, so he designed this cover.

This is the iconic Unknown Pleasures album from Joy Division.  

The starkness of the cover, without the band's name or the album's name, set it apart. The album itself was critically acclaimed, but it never rose high in the charts at the time. However, over time, the iconic status of the band and the album cover grew. In 1980, the lead singer, Ian Curtis, committed suicide. The remaining band members formed a new band, New Order, that went on to massive international fame.

By the 21st century, versions of the album cover were on beach towels, shoes, and tattoos.

Joy plots

In 2017, Claus Wilke created a new charting library for R, ggjoy.  His package enabled developers to create plots like the famous Unknown Pleasures album cover. In honor of the album cover, he called these plots joy plots.

Ridgeline plots

This story has a final twist to it. Although joy plots sound great, there's a problem.

Joy Division took their name from a real Nazi atrocity fictionalized in a book called House of Dolls. In some of their concentration camps, the Nazis forced women into prostitution. The camp brothels were called Joy Division

The name joy plots was meant to be fun and a callback to an iconic data visualization, but there's little joy in evil. Given this history, Wilke renamed his package ggridges and the plots ridgeline plots. 

Here's an example of the great visualizations you can produce with it. 

If you search around online, you can find people who've re-created the pulsar image using ggridges.

It's not just R programmers who are playing with Unknown Pleasures, Python programmers have got into the act too. Nicolas P. Rougier created a great animation based on the pulsar data set using the venerable Matplotlib plotting package - you can see the animation here.

If you liked this post

You might like these ones:

Monday, March 8, 2021

A masterclass in information visualization: the tube map

Going underground

The London Underground tube map is a master class in information visualization. It's been described in detail in many, many places, so I'm just going to give you a summary of why it's so special and what we can learn from it. Some of the lessons are about good visual design principles, some are about the limitations of design, but some of them are about wealth and poverty and the unintended consequences of abstraction.

(London Underground map.)

The problem

Starting in 1863, the underground train system in London grew in a haphazard fashion, with different railway companies building different lines and no sense of creating a coherent system. 

Despite the disorder, when it was first built it was viewed as a marvel and had a cultural impact beyond just transport; Conan Doyle wove it into Sherlock Holmes stories, H.G. Wells created science fiction involving it, and Virginia Woolf and others wrote of it too.

After various financial problems, the system was unified under government control. The government authority running it wanted to promote its use to reduce street-level congestion but the problem was, there were many different lines that only served part of the capital. Making it easy to use the system was hard.

Here's an early map of the system so you can see the problem.

1908 tube map

(1908 tube map. Image source: Wikimedia Commons.)

The map's hard to read and it's hard to follow. It's visually very cluttered and there are lots of distracting details; it's not clear why some things are marked on the map at all (why is ARMY & NAVY AND AUXILLARY STORES marked so prominently?). The font is hard to read, the text orientation is inconsistent, and the contrast of station names with the background isn't high enough.

The problem gets even worse when you zoom out to look at the entire system. Bear in mind, stations in central London are close together but they get further apart as you go into the suburbs. Here's an early map of the entire system, do you think you could navigate it?

(1931 whole system tube map.)

Of course, printing technology of the time was more limited than it is now, which made information representation harder.

Design ideas in culture

To understand how the tube map as we know it was created, we have to understand a little of the design culture of the time (the early 1930s).

Electrical engineering was starting as a discipline and engineers were creating circuit diagrams for new electrical devices. These circuit diagrams showed the connection between electrical components, not how they were laid out on a circuit board. Circuit diagrams are examples of topological maps.

(Example circuit diagram. Show electrical connections between components, not how they're laid out on a circuit board. Image source: Wikimedia Commons, License: Public domain.)

The Bauhaus school in Germany was emphasizing art and design in mass-produced items, bring high-quality design aesthetics into everyday goods. Ludwig Mies van der Rohe, the last director of the Bauhaus school, used a key aphorism that summarized much of their design philosophy: "less is more".

(Bauhaus kitchen design 1928 - they invented much of the modern design world. Image source: Wikimedia Commons, License: Public domain)

The modern art movement was in full swing, with the principles of abstraction coming very much to the fore. Artists were abstracting from reality in an attempt to represent an underlying truth about their subjects or about the world.

(Piet Mondrian, Composition 10. Image source: Wikimedia Commons, License: Public Domain.)

To put it simply, the early 1930s were a heyday of design that created much of our modern visual design language.

Harry Beck's solution - form follows function

In 1931, Harry Beck, a draughtsman for London Underground, proposed a new underground map. Beck's map was clearly based on circuit diagrams: it removed unnecessary detail to focus on what was necessary. In Beck's view, what was necessary for the tube was just the stations and the lines, plus a single underlying geographical detail, the river Thames.

Here's his original map. There's a lot here that's very, very different from the early geographical maps.

The design grammar of the tube map

The modern tube map is a much more complex beast, but it still retains the ideas Harry Beck created. For simplicity, I'm going to use the modern tube map to explain Beck's design innovations. There is one underlying and unifying idea behind everything I'm going to describe: consistency.

Topological not geographical.This is the key abstraction and it was key to the success of Beck's original map. On the ground, tube lines snake around and follow paths determined by geography and the urban landscape. This makes the relationship between tube lines confusing. Beck redrew the tube lines as straight lines without attempting to preserve the geographic relations of tube lines to one another. He made the stations more or less equidistant from each other, whereas, on the ground, the distance between stations varies widely. 

The two images below show the tube map and a geographical representation of the same map. Note how the tube map substantially distorts the underlying geography.

(The tube map. Image source: BBC.)

(A geographical view of the same system. Image source: Wikimedia Commons.)

Removal of almost all underlying geographical features. The only geographical feature on tube maps is the river Thames. Some versions of the tube map removed it, but the public wanted it put back in, so it's been a consistent feature for years now.

(The river Thames, in blue, is the only geographic feature on the map.)

A single consistent font.  Station names are written with the same orientation. Using the same font and the same text orientation makes reading the map easier. The tube has its own font, New Johnston, to give a sense of corporate identity.

(Same text orientation, same font everywhere.)

High contrast. This is something that's become easier with modern printing technology and good quality white paper. But there are problems. The tube uses a system of fare zones which are often added to the map (you can see them in the first two maps in this section, they're the gray and white bands). Although this is important information if you're paying for your tube ticket, it does add visual clutter. Because of the number of stations on the system, many modern maps add a grid so you can locate stations. Gridlines are another cluttering feature.

Consistent symbols. The map uses a small set of symbols consistently. The symbol for a station is a 'tick' (for example, Goodge Street or Russell Square). The symbol for a station that connects two or more lines is a circle (for example, Warren Street or Holborn).

Graphical rules. Angles and curves are consistent throughout the map, with few exceptions - clearly, the map was constructed using a consistent set of layout rules. For example, tube lines are shown as horizontal, vertical, or 45-degree lines in almost all cases.

The challenge for the future

The demand for mass transit in London has been growing for very many years which means London Underground is likely to have more development over time (new lines, new stations). This poses challenges for map makers.

The latest underground maps are much more complicated than Harry Beck's original, newer maps incorporate the south London tram system, some overground trains, and of course the new Elizabeth Line. At some point, a system becomes so complex that even an abstract simplification becomes too complex. Perhaps we'll need a map for the map.

A trap for the unwary

The tube map is topological, not geographical. On the map, tube stations are roughly the same distance apart, something that's very much not the case on the ground.

Let's imagine you had to go from Warren Street to Great Portland Street. How would you do it? Maybe you would get the Victoria Line southbound to Oxford Circus, change to the Bakerloo Line northbound, change again at Baker Street, and get the Circle Line eastbound to Great Portland Street. That's a lot of changes and trains. Why not just walk from Warren Street to Great Portland Street? They're less than 500m apart and you can do the walk in less than 5 minutes. The tube map misleads people into doing stuff like this all the time.

Let's imagine it's a lovely spring day and you're traveling to Chesham on the Metropolitan Line. If Great Portland Street and Warren Street are only 482m apart, then it must be a nice walk between Chalfont & Latimer and Chesham, especially as they're out in the leafy suburbs. Is this a good idea? Maybe not. These stations are 6.19km apart.

Abstractions are great, but you need to understand that's exactly what they are and how they can mislead you.

Using the map to represent data

The tube map is an icon, not just of the tube system, but for London itself. Because of its iconic status, researchers have used it as a vehicle to represent different data about the city.

James Cheshire of University College London mapped life expectancy data to tube stations, the idea being, you can spot health disparities between different parts of the city. He produced a great map you can visit at tubecreature.com. Here's a screenshot of part of his map.


You go from a life expectancy of 78 at Stockwell to 89 at Green Park, but the two stations are just 4 stops apart. His map shows how disparities occur across very short distances.

Mark Green of the University of Sheffield had a similar idea, but this time using a more generic deprivation score. Here's his take on deprivation and the tube map, the bigger circles representing higher deprivation.

Once again, we see the same thing, big differences in deprivation over short distances.

What the tube map hides

Let me show you a geographical layout of the modern tube system courtesy of Wikimedia. Do you spot what's odd about it?

(Geographical arrangement of tube lines. Image source: Wikimedia Commons, License: Creative Commons.)

Look at the tube system in southeast London. What tube system? There are no tube trains in southeast London. North London has lots of tube trains, southwest London has some, and southeast London has none at all. What part of London do you think is the poorest?

The tube map was never designed to indicate wealth and poverty, but it does that. It clearly shows which parts of London were wealthy enough to warrant underground construction and which were not. Of course, not every area in London has a tube station, even outside the southeast of London. Cricklewood (population 80,000) in northwest London doesn't have a tube station and is nowhere to be seen on the tube map. 

The tube map leaves off underserved areas entirely, it's as if southeast London (and Cricklewood and other places) don't exist. An abstraction meant to aid the user makes whole communities invisible.

Now look back at the previous section and the use of the tube map to indicate poverty and inequality in London. If the tube map is an iconic representation of London, what does that say about the areas that aren't even on the map? Perhaps it's a case of 'out of sight, out of mind'.

This is a clear reminder that information design is a deeply human endeavor. A value-neutral expression of information doesn't exist, and maybe we shouldn't expect it to.

Takeaways for the data scientist

As data scientists, we have to visualize data, not just for our fellow data scientists, but more importantly for the businesses we serve. We have to make it easy to understand and easy to interpret data. The London Underground tube map shows how ideas from outside science (circuit diagrams, Bauhaus, modernism) can help - information representation is, after all, a human endeavor. But the map shows the limits to abstraction and how we can be unintentionally led stray. 

The map also shows the hidden effects of wealth inequality and the power of exclusion - what we do does not exist in a cultural vacuum, which is true for both the tube map and the charts we produce too.

Monday, January 25, 2021

3D plotting: how hard can it be?

Why aren't 2D plots good enough?

Most data visualization problems involve some form of two-dimensional plotting, for example plotting sales by month. Over the last two hundred years, analysts have developed several different types of 2D plots, including scatter charts, line charts, and bar charts, so we have all the chart types we need for 2D data. But what happens if we have a 3D dataset? 

The dataset I'm looking at is English Premier League (EPL) results. I want to know how the full-time scores are distributed, for example, are there more 1-1 results than 2-1 results? I have three numbers, the full-time home goals (FTHG), the full-time away goals (FTAG). and the number of games that had that score. How can I present this 3D data in a meaningful way? 

(You can't rely on 3D glasses to visualize 3D data. Image source: Wikimedia Commons, License: Creative Commons, Author: Oliver Olschewski)

Just the text

The easiest way to view the data is to create a table, so here it is. The columns are the away goals, the rows are the home goals, and the cell values are the number of matches with that result, so 778 is the number of matches with a score of 0-1.


This presentation is easy to do, and relatively easy to interpret. I can see 1-1 is the most popular score, followed by 1-0. You can also see that some scores just don't occur (9-9) and results with more than a handful of goals are very uncommon.

This is OK for a smallish dataset like this, but if there are hundreds of rows and/or columns, it's not really viable. So what can we do?

Heatmaps

A heatmap is a 2D map where the 3rd dimension is represented as color. The more intense (or lighter) the color, the higher the value. For this kind of plot to work, you do have to be careful about your color map. Usually, it's best to choose the intensity of just one color (e.g. shades of blue). In a few cases, multiple colors can work (colors for political parties), but those are the exceptions. 

Here's the same data plotted as a heatmap using the Brewer color palette "RdPu" (red-purple).

The plot does clearly show the structure. It's obvious there's a diagonal line beyond which no results occur. It's also obvious which scores are the most common. On the other hand, it's hard to get a sense of how quickly the frequency falls off because the human eye just isn't that sensitive to variations in color, but we could probably play around with the color scale to make the most important color variation occur over the range we're interested in. 

This is an easy plot to make because it's part of R's ggplot package. Here's my code:

plt_goal_heatmap <- goal_distribution %>% 
  ggplot(aes(FTHG, FTAG, fill=Matches)) + 
  geom_tile() +   
  scale_fill_distiller(palette = "RdPu") +
  ggtitle("Home/Away goal heatmap")

Perspective scatter plot

Another alternative is the perspective plot, which in R, you can create using the 'persp' function. This is a surface plot as you can see below.
You can change your perspective on the plot and view it from other angles, but even from this perspective, it's easy to see the very rapid falloff in frequency as the scores increase. 

However, I found this plot harder to use than the simple heatmap, and I found changing my viewing angle was awkward and time-consuming.

Here's my code in case it's useful to you:

persp(x = seq(0, max(goal_distribution$FTHG)), 
      y = seq(0, max(goal_distribution$FTAG)), 
      z = as.matrix(
        unname(
          spread(
            goal_distribution, FTAG, Matches, fill=0)[,-1])), 
      xlab = "FTHG", ylab = "FTAG", zlab = "Matches", 
      main = "Distribution of matches by score",
      theta = 60, phi = 20, 
      expand = 1, 
      col = "lightblue")

3D scatter plot

We can go one stage further and create a 3D scatter chart. On this chart, I've plotted the x, y, and z values and color coded them so you get a sense of the magnitude of the z values. I've also connected the points to the axis (the zero plane if you like) to emphasize the data structure a bit more.


As with the persp function,  you can change your perspective on the plot and view it from another angle.

The downside with this approach is it requires the 'plot3D' library in R and it requires you to install a new graphics server (XQuartz). It's a chunk of work to get to a visualization. The function to draw the plot is 'scatter3D'. Here's my code:

scatter3D(x=goal_distribution$FTHG, 
          y=goal_distribution$FTAG, 
          z=goal_distribution$Matches, 
          xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
          phi = 5, 
          theta = 40,
          bty = "g",  
          type = "h", 
          pch = 19,
          main="Distribution of matches by score",
          cex = 0.5)

What's my choice?

My goal was to understand the distribution of goals in the EPL, so what presentations of the data were most useful to me?

The simple table worked well and was the most informative, followed by the heatmap. I found both persp and scatter3D to be awkward to use and both consumed way more time than they were worth. The nice thing about the heatmap is that it's available as part of the wonderful ggplot library.

Bottom line: keep it simple.

Tuesday, October 6, 2020

Faster Python BI app development through code generation

Back to the future: design like it's 1999

Back in 1999, you could build Visual Basic apps by dragging and dropping visual components (widgets) onto a canvas. The Visual Basic IDE handled all the code generation, leaving you with the task of wiring up your new GUI to your business data. It wasn't just Visual Basic though, you could do the same thing with Visual C++ and other Microsoft versions of languages. The generated code wasn't the prettiest, but it worked, and it meant you could get the job done quickly.

(Microsoft Visual Basic. Image credit: Microsoft.)

Roll forward twenty years. Python is now very popular and people are writing all kinds of software using it, including software that needs UIs. Of course, the UI front-end is now the browser, which is another change. Sadly, nothing like the UI building capabilities of the Microsoft Visual Studio IDE exists for Python; you can't build Python applications by dragging and dropping widgets onto a canvas.

Obviously, BI tools like Tableau and Qlik fulfill some of the need to quickly build visualization tools; they've inherited the UI building crown from Microsoft. Unfortunately, they run out of steam when the analysis is complex; they have limited statistical capabilities and they're not good as general-purpose programming languages.

If your apps are 'simple', obviously, Tableau or Qlik are the way to go. But what happens if your apps involve more complex analysis, or if you have data scientists who know Python but not Tableau?

What would it take to make a Visual Basic or Tableau-like app builder for Python? Could we build something like it?

Start with the end in mind

The end goal is to have a drag and drop interface that looks something like this.

(draw.io. Image credit: draw.io.)

On the left-hand side of the screenshot, there's a library of widgets the user can drag and drop onto a canvas. 

Ideally, we'd like to be able to design a multi-tabbed application and move widgets onto each tab from a library. We'd do all the visualization layout on the GUI editor and maybe set up some of the properties for the widgets from the UI too. For example, we might set up the table column names, or give a chart a title and axis titles. When we're done designing, we could press a button and generate outline code that would create an application with the (dummy) UI we want.

A step further would be to import existing Python code into the UI editor and move widgets from tab to tab, or add new widgets, or delete unwanted widgets.

Conceptually, all the technology to do this exists right now, just not in one place. Unfortunately, it would take considerable effort to produce something like it. 

If we can't go all the way, can we at least go part of the way?

A journey of a thousand miles begins with a single step

A first step is code generation from a specification. The idea is simple: you define your UI in a specification file that software uses to generate code. 

For this first simple step (and the end goal), there are two things to bear in mind:

  • Almost all UI-based applications can be constructed using a Model-View-Controller architecture (pattern) or something that looks like it.
  • Python widgets are similar and follow well-known rules. For example, the widgets in Bokeh follow an API; a button follows certain rules, a dropdown menu follows certain rules and so on.
Given that there are big patterns and small patterns here, we could use a specification file to generate code for almost all UI-based applications.

I've created software that does this, and I'm going to tell you about it.

JSON and the argonauts

Here's an overview of how my code generation software works.

  • The Model-View-Controller code exists as a series of templates, with key features added at code generation time.
  • The application is specified in a JSON file. The JSON file contains details of each tab in the application, along with details of the widgets on the tab. The JSON file must follow certain rules; for example, no duplicate names.
  • Most of the rules for code generation are in a JSON schema file that contains details for each Bokeh widget. For example, the JSON schema has rules for how to implement a button, including how to create a callback function for a button.
Here's how it works in practice.
  1. The user creates a specification file in JSON. The JSON file has details of:
    • The overall project (name, copyright, author etc.)
    • Overall data for each tab (e.g. name of each tab and a description of what it does).
    • For each tab, there's a specification for each widget, giving its name, its argument, and a comment on what it does.
  2. The system checks the user's JSON specification file for consistency (well-formed JSON etc.)
  3. Using a JSON schema file that contains the rules for constructing Bokeh widgets, the system generates code for each Bokeh widget in the specification.
    • For each widget that could have a callback, the system generates the callback code.
    • For complex widgets like DataTable and FileInput, the system generates skeleton example code that shows how to implement the widget. In the DataTable case, it sets up a dummy data source and table columns.
  4. The system then adds the generated code to the Model-View-Controller templates and generates code for the entire project.
    • The generated code is PEP8 compliant by design.
The generated code is runnable, so you can test out how the UI looks.

Here's an excerpt from the JSON schema defining the rules for building widgets:

            "allOf":[

                    {

                      "$comment":"███ Button ███",

                      "if":{

                        "properties":{

                          "type":{

                            "const":"Button"

                          }

                        }

                      },

                      "then":{

                        "properties":{

                          "name":{

                            "$ref":"#/definitions/string_template_short"

                          },

                          "description":{

                            "$ref":"#/definitions/string_template_long"

                          },

                          "type":{

                            "$ref":"#/definitions/string_template_short"

                          },

                          "arguments":{

                            "type":"object",

                            "additionalProperties":false,

                            "required":[

                              "label"

                            ],

                            "properties":{

                              "label":{

                                "type":"string"

                              },

                              "sizing_mode":{

                                "type":"string",

                                "default":"stretch_width"

                              },

                              "button_type":{

                                "type":"string",

                                "default":"success"

                              }

                            }

                          },

Here's an excerpt from the JSON file defining an application's UI:

{

      "name":"Manage data",

      "description":"Panel to manage data sources.",

      "widgets":[

        {

          "name":"ECV year allocations",

          "description":"Displays the Electoral College Vote allocations by year.",

          "type":"TextInput",

          "disabled":true,

          "arguments":{

            "title":"Electoral College Vote allocations by year in system",

            "value":"No allocations in system"

          }

        },

        {

          "name":"Election results",

          "description":"Displays the election result years in the system.",

          "type":"TextInput",

          "disabled":true,

          "arguments":{

            "title":"Presidential Election results in system",

            "value":"No allocations in system"

          }

What this means in practice

Using this software, I can very rapidly prototype BI-like applications. The main task left is wiring up the widgets to the business data in the Model part of the Model-View-Controller architecture. This approach reduces the tedious part of UI development but doesn't entirely eliminate it. It also helps with widgets like DataTable that require a chunk of code to get them working - this software generates most of that code for you.

How things could be better

The software works, but not as well as it could:

  • It doesn't do layout. Laying out Bokeh widgets is a major nuisance and a time suck. 
  • The stubs for Bokeh DataTable are too short - ideally, the generated code should contain more detail which would help reduce the need to write code.
  • The Model-View-Controller architecture needs some clean up.

The roadmap

I have a long shopping list of improvements:
  • Better Model-View-Controller
  • Robust exception handling in the generated code
  • Better stubs for Bokeh widgets like DataTable
  • Automatic Sphinx documentation
  • Layout automation

Is it worth it?

Yes and no.

For straightforward apps, it will still be several times faster to write apps in Tableau or Qlik. But if the app requires more statistical firepower, or complex analysis, or linkage to other systems, then Python wins and this approach is worth taking. If you have access to Python developers, but not Tableau developers, then once again, this approach wins.

Over the longer term, regardless of my efforts, I can clearly see Python tools evolving to the state where they can compete with Qlik and Tableau for speed of application development.

Maybe in five years' time, we'll have all of the functionality we had 25 years ago. What's old is new again.