Showing posts with label data visualization. Show all posts

Thursday, February 27, 2020

Pie charts are lie charts

There are lots of chart types, but if you want to lie or mislead people, the best chart to use is the pie chart. I’m going to show you how to distort reality with pie charts, not so you can be a liar, but so you know never to use pie charts and to choose more honest visualizations.

Let's start with the one positive thing I know about pie charts: they're called camembert charts in France and cake charts in Germany. On balance, I prefer the French term, but we're probably stuck with the English term. Unlike camembert, pie charts often leave a bad taste in my mouth and I'll show you why.

(Camembert cheese - image credit: Coyau, Wikipedia - license : Creative Commons)

Take a look at the pie chart below. Can you put the six slices in order from largest to smallest? What percentages do you think the slices represent?

Here’s how I’ve misled you:

Offset the slices from the 12 o’clock position to make size comparison harder. I've robbed you of the convenient 'clock face' frame of reference.
Not put the slices in order (largest to smallest). Humans are bad at judging the relative sizes of areas and by playing with the order, I'm making it even harder.
Not labeled the slices. This ought to be standard practice, but shockingly often isn't.

The actual percentages are:

Gray	20.9
Green	17.5
Light blue	16.8
Dark blue	16.1
Yellow	15.4
Orange	13.3

How close were you? How good was my attempt to deceive you?

Let’s use a bar chart to represent the same data.

Simple, clear, unambiguous.

I've read guidance that suggests you should only use a pie chart if you're showing two quantities that are obviously unequal. This gives the so-called pac-man pie charts. Even here, I think there are better representations, and our old-friend the bar chart would work better (albeit less interestingly).

Now let’s look at the king of deceptive practices, the 3d pie chart. This one is great because you can thoroughly mislead while still claiming to be honest. I’m going to work through a short deceptive example.

Let’s imagine there are four political parties standing in an election. The percentage results are below.

Dog	36
Cat	28
Mouse	21
Bird	15

You work for Bird, which unfortunately got the lowest share of the vote. Your job is to deceive the electorate into thinking Bird did much better than they did.

You can obscure the result by showing it as a pie chart without number labels. You can even mute the opposition colors to fool the eye. But you can go one better. You can create a 3d pie chart with shifted perspective and 'point explosion' using the data I gave above like so.

Here's what I did to create the chart:

Took the data above as my starting point and created a pie chart.
Rotated the chart so my slice was at the bottom.
Made the pie chart 3d.
Changed the perspective to emphasize my party.
Used 'point explosion' to pull my slice out of the main body of the chart to emphasize it even more.
Used shading.

This now makes it look like Bird was a serious contender in the election. The fraction of the chart area taken up with the Bird party’s color is completely disproportionate to their voter share. But you can claim honesty because the slice is still the correct proportion if the chart was viewed from above. If challenged, you can turn it into a technical/academic debate about data visualization that will turn off most people and make your opponents sound like they’re nit-picking.

You don’t have to go this far to mislead with a pie chart. All you have to do is increase the cognitive burden to interpret a chart. Some, maybe even all, of your audience might not spot what you’re trying to hide because they’re in a hurry. You can mislead some of your audience all of the time.

I want to be clear, I'm telling you about these deceptive practices so you can avoid them. There are good reasons why honest analysts don’t use pie charts. In fact, I would go one stage further; if you see a pie chart, be on your guard against dishonesty. As one of my colleagues used to say, ‘friends don’t let friends use pie charts’.

Tuesday, January 28, 2020

Future directions for Python visualization software

The Python charting ecosystem is highly fragmented and still lags behind R, it also lacks some of the features of paid-for BI tools like Tableau or Qlik. However, things are slowly changing and the situation may be much better in a few years' time.

(Image credit: Alvesgaspar - Wikimedia Commons - license)

Theoretically, the ‘grammar of graphics’ approach has been a substantial influence on visualization software. The concept was introduced in 1999 by Leland Wilkinson in a landmark book and gained widespread attention through Hadley Wickham’s development of ggplot2 The core idea is that a visualization can be represented as different layers within a framework, with rules governing the relationship between layers.

Bokeh was influenced by the 'grammar of graphics' concept as were other Python charting libraries. The Vega project seeks to take the idea of the grammar of graphics further and creates a grammar to specify visualizations independent of the visualization backend module. Building on Vega, the Altair project is a visualization library that offers a different approach from Bokeh to build charts. It’s clear that the grammar of graphics approach has become central to Python charting software.

If the legion of charting libraries is a negative, the fact that they are (mostly) built on the same ideas offers some hope for the future. There’s a movement to convergence by providing an abstraction layer above the individual libraries like Bokeh or Matplotlib. In the Python world, there’s precedence for this; the database API provides an abstraction layer above the various Python database libraries. Currently, the Panel project and HoloViews are offering abstraction layers for visualization, though there are discussions of a more unified approach.

My take is, the Python world is suffering from having a confusing array of charting library choices which splits the available open-source development efforts across too many projects, and of course, it confuses users. The effort to provide higher-level abstractions is a good idea and will probably result in fewer underlying charting libraries, however, stable and reliable abstraction libraries are probably a few years off. If you have to produce results today, you’re left with choosing a library now.

The big gap between Python and BI tools like Tableau and Qlik is the ease of deployment and speed of development. BI tools reduce the skill level to build apps, deploy them to servers, and manage tasks like access control. Projects like Holoviews may evolve to make chart building easier, but there are still no good, easy, and automated deployment solutions. However, some of the component parts for easier deployment exist, for example, Docker, and it’s not hard to imagine the open-source community moving its attention to deployment and management once the various widget and charting issues of visualization have been solved.

Will the Python ecosystem evolve to be as good as R’s and be good enough to take on BI tools? Probably, but not for a few years. In my view, this evolution will happen slowly and in public (e.g. talks at PyCon, SciPy etc.). The good news for developers is, there will be plenty of time to adapt to these changes.

Saturday, January 25, 2020

How to lie with statistics

I recently re-read Darrell Huff's classic text from 1954, 'How to lie with statistics'. In case you haven't read it, the book takes a number of deceitful statistical tricks of the trade and explains how they work and how to defend yourself from being hoodwinked. My overwhelming thought was 'plus ça change'; the more things change, the more they remain the same. The statistical tricks people used to mislead 50 years ago are still being used today.

(Image credit: Wikipedia)

Huff discusses surveys and how very common methodology flaws can produce completely misleading results. His discussion of sampling methodologies, and the problems with them, are clear and unfortunately, still relevant. Making your sample representative is a perennial problem as the polling for the 2016 Presidential election showed. Huff's bias comments especially rang true to me, given my prior experiences as a market research street interviewer. In my experience, even people with a very good statistical education aren't aware of survey flaws and sources of bias.

The chapter on averages still holds up. Huff shows how the mean can be distorted and why the median might be a better choice. I've interviewed people with Master's degrees in statistics who couldn't explain why the median might be a better choice of average than the mean, so I guess there's still a need for the lesson.

One area where I think things have moved in the right direction is the decreasing use of some types of misleading charts. Huff discusses the use of images to convey quantitative information. He shows a chart where steel production was represented by images of a blast furnace (see below). The increase in production was 50%, but because the height and width were both increased, the area consumed by the images increases by 150%, giving the overall impression of a 150% increase in production¹. I used to see a lot of these types of image-based charts, but their use has declined over the years. It would be nice to think Huff had some effect.

(Image credit: How to lie with statistics)

Staying with charts, his discussion about selecting axis ranges to mislead still holds true and there are numerous examples of people using this technique to mislead every day. I might write a blog post about this at some point.

He has chapters on the post hoc fallacy (confusing correlation and causation) and has a nice explanation of how percentages are regularly mishandled. His discussion of general statistical deceitfulness is clear and still relevant.

Unfortunately, the book hasn't aged very well in other aspects. 2020 readers will find his language sexist, the jokey drawings of a smoking baby are jarring, and his roundabout discussion of the Kinsey Reports feels odd. Even the writing style is out of date.

Huff himself is tainted; he was funded by the tobacco industry to speak out against smoking as a cause of cancer. He even wrote a follow-up book, How to lie with smoking statistics to debunk anti-smoking data. Unfortunately, his source of authority was the widespread success of How to lie with statistics. How to lie with smoking statistics isn't available commercially anymore, but you can read about it on Alex Reinhart's page.

Despite all its flaws, I recommend you read this book. It's a quick read and it'll give you a grounding in many of the problems of statistical analysis. If you're a business person, I strongly recommend it - its lessons about cautiously interpreting analysis still hold.

Correction

[1] Colin Warwick pointed out an error in my original text. My original text stated the height and width of the second chart increased by 50%. That's not quite what Huff said. I've corrected my post.

Wednesday, January 22, 2020

The Python plotting ecosystem

Python’s advance as a data processing language had been hampered by its lack of good quality chart visualization libraries, especially compared to R with its ggplot2 and Shiny packages. By any measure, ggplot2 is a superb package that can produce stunning, publication-quality charts. R’s advance has also been helped by Shiny, a package that enables users to build web apps from R, in effect allowing developers to create Business Intelligence (BI) apps. Beyond the analytics world, the D3 visualization library in JavaScript has had an impact on more than the JavaScript community; it provides an outstanding example of what you can do with graphics in the browser (if you get time check out some of the great D3 visualization examples). Compared to D3, ggplot2, and Shiny, Python’s visualization options still lag behind, though things have evolved in the last few years.

(An example Bokeh application. Multi-tabbed, widgets, chart.)

Matplotlib is the granddaddy of chart visualization in Python, it offers most of the functionality you might want and is available with almost every Python distribution. Unfortunately, its longevity is also its problem. Matplotlib was originally based on MATLAB’s charting features, which were in turn developed in the 1980’s. Matplotlib's longevity has left it with an awkward interface and some substantially out-of-date defaults. In recent years, the Matplotlib team has updated some of their visual defaults and offered new templates that make Matplotlib charts less old-fashioned, however, the library is still oriented towards non-interactive charts and its interface still leaves much to be desired.

Seaborn sits on top of Matplotlib and provides a much more up-to-date interface and visualization defaults. If all you need is a non-interactive plot, Seaborn may well be a good option; you can produce high-quality plots in a rational way and there are many good tutorials out there.

Plotly provides static chart visualizations too, but goes a step further and offers interactivity and the ability to build apps. There are some great examples of Plotly visualizations and apps on the web. However, Plotly is a paid-for solution; you can do most of what you want with the free tier, but you may run into cases where you need to purchase additional services or features.

Altair is another plotting library for Python based on the 'grammar of graphics’ concept and the Vega project. Altair has some good features, but in my view, it isn’t as complete as Bokeh for business analytics.

Bokeh is an ambitious attempt to offer D3 and ggplot2-like charts plus interactivity, with visualizations rendered in a browser, all in an open-source and free project. Interactivity here means having tools to zoom into a chart or move around in the chart, and it means the ability to create (browser-based) apps with widgets (like dropdown menus) similar to Shiny. It’s possible to create chart-based applications and deploy them via a web server, all within the Bokeh framework. The downside is, the library is under active development and therefore still changing; some applications I developed a year ago no longer work properly with the latest versions. Having said all that, Bokeh is robust enough for commercial use today, which is why I’ve chosen it for most of my visualization work.

Holoviews sits on top of Bokeh (and other plotting engines) and offers a higher-level interface to build charts using less coding.

It’s very apparent that the browser is becoming the default visualization vehicle for Python. This means I need to mention Flask, a web framework for Python. Although Bokeh has a lot of functionality for building apps, if you want to build a web application that has forms and other typical features of web applications, you should use Flask and embed your Bokeh charts within it.

If you’re confused by the various plotting options, you’re not alone. Making sense of the Python visualization ecosystem can be very hard, and it can be even harder to choose a visualization library. I looked at the various options and chose Bokeh because I work in business and it offered a more complete and reliable solution for my business needs. In a future blog post, I'll give my view of where things are going for Python visualization.

Saturday, January 11, 2020

Choropleth maps - pretty but misleading

Pretty, but misleading

You see choropleth maps everywhere: on websites, on the TV news, and in applications. They’re very pretty and they appeal to our sense of geography, but they can be horribly misleading. I’m going to show you why that’s the case and show you ways designers have sought to get around their problems.

What's a choropleth map?

A choropleth map is a geographic map with regions colored according to some criteria. A great example is election maps used in US Presidential elections. Each of the states is colored according to the party that won the state. Here’s an example result from a US Presidential election. Can you see what the problem is?

(Image Credit: adapted from Wikipedia.)

Looking at the map, who do you think won the election (I’m deliberately not telling you which election)? Do you think this election was a close one?

The trouble is, the US population density varies considerably from state to state, as does the number of Electoral College votes. In 2020, Rhode Island will have 4 Electoral College votes compared to Montana’s 3, but Montana is 120 times larger on the map. If you just glance at most US Presidential election choropleth maps, it looks like the Republican candidate won, even when he didn’t. That's because geographically large rural states are mostly Republican and have relatively low populations and hence fewer Electoral College votes, so the election choropleth map looks mostly red. Our natural tendency is to assume more ink = more important, but the choropleth map breaks this relationship giving a misleading representation.

By the way, the map I showed you is from the 1976 election, in which Jimmy Carter won 267 Electoral College votes to Gerard Ford's 240 (24 states to 27). Did you get who won from the map? Did you get the size of the victory?

Let’s take another example, the 2014 Scottish Independence referendum. Here’s the result by council area (local government area), pink for remain, green for independence. In this case, the remainers won, but what do you think the margin was? Was it close?

(Image credit: adapted from Wikipedia entry: https://commons.wikimedia.org/wiki/File:Scottish_independence_referendum_results.svg)

Despite the choropleth map’s overwhelming remain coloring, the actual result was 55%-45%. It looks like an overwhelming remain victory because the Scottish population is concentrated in a few areas. As in the US, there are large rural areas with few people that take up large amounts of chart space, exaggerating their level of importance.

Cartograms

Can we somehow represent data using some kind of map in a more proportional way? Cartograms distort the underlying geography to better represent some underlying variable. There’s a great example on the Geographical.co.uk website for the UK 2019 general election.

(Image credit: geographical.co.uk - taken from the article: http://geographical.co.uk/geopolitics/geopolitics/item/3557-ukge2019)

The chart on the left is a straight-up choropleth map of the UK by parliamentary constituency - note that constituencies have roughly equal populations but can have very different geographical areas. Blue is a Conservative parliamentary seat, yellow is SNP, orange is Liberal Democrat and red is Labour. The result was a decisive Conservative victory, but the choropleth map makes it look like a wipeout, which it wasn't. The middle chart makes each parliamentary constituency the same size on the page (actually hexagons for reasons I won't go into). Rural areas are shrunk and urban areas (especially London) are greatly expanded. The cartogram shows a much fairer, and in my view, more reasonable representation of the result. The chart on the right is a 'gridded population cartogram' that sizes each constituency by the number of people living in it. There are several of these maps on the web, but frankly, I've never been able to make much sense of them.

Cartograms in the house of mirrors

Mark Newman at the University of Michigan has a site presenting the 2016 US Presidential election as cartograms which is worth a look. Here's his map scaling the states to their Electoral College votes. Because the proportion of red and blue ink follows the Electoral College votes, the cartogram gives a fairer representation of the result (I find this representation easier to understand than the 2019 UK election result in the third chart above).

(Image credit: Mark Newman, University of Michigan.)

An alternative approach is to use hexagons to represent the result:

(Image credit: ESRI UK & Ireland: https://communityhub.esriuk.com/geoxchange/2016/11/1/us-election-2016-battle-of-the-maps)

The hexagon representation has become much more popular in recent years, leading to designers calling this kind of chart a hexagram. As you might have guessed, on the whole, I prefer this type of cartogram.

All this is great in theory, but in practice there are problems. Hexagrams are great, but they're still unfamiliar to many users and might require explanation. They can also distort geography, reducing the display's usefulness, for example, can you easily identify Utah on the 2016 US Presidential hexagram above? Most packages and modules that display data don't yet come with out-of-the-box cartograms, meaning developers have to create something from scratch, which takes a lot more effort.

What I do

Here's my approach: use choropleth maps for planning, and use hexagrams or other charts to represent quantitative results. Planning usually involves some sense of geography, for example, territory allocation in sales, in this case, a choropleth map can be useful because of its close ties to the underlying geography. To represent quantitative information (like election results), I prefer bar charts or other traditional charts. If you want something that's more geographical, I recommend some form of hexagram, but with the warning that you might have to build it yourself which can be very time-consuming.

Finding out more

Danny Dorling has written extensively about cartograms and I recommend his website: http://www.dannydorling.org/
The WorldMapper website presents lots of examples of cartograms using social and political data: https://worldmapper.org/