Tuesday, January 28, 2020
Theoretically, the ‘grammar of graphics’ approach has been a substantial influence on visualization software. The concept was introduced in 1999 by Leland Wilkinson in a landmark book and gained widespread attention through Hadley Wickham’s development of ggplot2 The core idea is that a visualization can be represented as different layers within a framework, with rules governing the relationship between layers. In turn, Bokeh was influenced by the 'grammar of graphics' concept as were other Python charting libraries. The Vega project seeks to take the idea of the grammar of graphics further and creates a grammar to specify visualizations independent of the visualization backend module. Building on Vega, the Altair project is a visualization library that offers a different approach from Bokeh to build charts. It’s clear that the grammar of graphics approach has become central to Python charting software.
If the legion of charting libraries is a negative, the fact that they are (mostly) built on the same ideas offers some hope for the future. There’s a movement to convergence by providing an abstraction layer above the individual libraries like Bokeh or Matplotlib. In the Python world, there’s precedence for this; the database API which provides an abstraction layer above the various Python database libraries. Currently, the Panel project and HoloViews are offering abstraction layers for visualization, though there are discussions of a more unified approach.
My take is, the Python world is suffering from having a confusing array of charting library choices which splits the available open-source development efforts across too many projects, and of course it confuses users. The effort to provide higher-level abstractions is a good idea and will probably result in fewer underlying charting libraries, however, stable and reliable abstraction libraries are probably a few years off. If you have to produce results today, you’re left with choosing a library now.
The big gap between Python and BI tools like Tableau and Qlik is ease of deployment and speed of development. BI tools reduce the skill level to build apps, to deploy them to servers, and to manage tasks like access control. Projects like Holoviews may evolve to make chart building easier, but there are still no good easy and automated deployment solutions. However, some of the component parts for easier deployment exist, for example, Docker, and it’s not hard to imagine the open-source community moving its attention to deployment and management once the various widget and charting issues of visualization have been solved.
Will the Python ecosystem evolve to be as good as R’s and be good enough to take on BI tools? Probably, but not for a few years. In my view, this evolution will happen slowly and in public (e.g. talks at PyCon, SciPy etc.). The good news for developers is, there will be plenty of time to adapt to these changes.
Saturday, January 25, 2020
I remember finding his wonderful book, "The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail", on the shelf at City University in London and devouring it on the way home on the tube. I only recommend a handful of business books and Christensen's book is number one on my list.
Christensen was fascinated by technology disruption, especially how industry leaders are caught out by change and why they don't respond in time. The two big examples in his book are steam shovels and disk drives.
The hydraulic shovel was a cheap and cheerful innovation that didn't seem to threaten the incumbent steam shovel makers. If anything, it was good for their profitability because hydraulic technology addressed the low end of the market where margins were poor, leaving the high-end more profitable market niches to steam shovels. Unfortunately for the steam shovel makers, the hydraulic shovel makers kept on innovating and pushed their technology into more and more niches. In the end, the steam shovel makers were relegated to just a few small niches and it was too late to respond.
The disk drive industry is Christensen's most powerful example. There have been successive waves of innovation in this industry as drives increased in capacity and reduced in size. The leaders in one wave were mostly not the leaders in subsequent waves. Christensen's insight was, the same factors were at work in the disk drive industry as in the steam shovel industry. Incumbents wanted to maximize profitability and saw new technologies coming in at the bottom end of the market where margins were low. Innovations were dismissed as being low-end technologies not competing with their more profitable business. They didn't respond in time as the disruptive technologies increased their capabilities and started to compete with them.
Based on these examples, Christensen teases out why incumbents didn't respond in time and what companies can do to not be caught out by these kinds of disruptive innovations.
Company culture is obviously a factor here, and Christensen poked into this more with a very accessible 2000 paper where he and Overdorf discussed the factors that can lead to a company culture that ignores disruptive innovation.
I never met Christensen, but I heard a lot of good things about him as a person. My condolences to his family.
"The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail" - Clayton Christensen
"Meeting the Challenge of Disruptive Change", Clayton M. Christensen and Michael Overdorf, Harvard Business Review, https://hbr.org/2000/03/meeting-the-challenge-of-disruptive-change
Huff discusses surveys and how very common methodology flaws can produce completely misleading results. His discussion of sampling methodologies and the problems with them are clear and unfortunately, still relevant. Making your sample representative is a perennial problem as the 2016 Presidential election showed. Years ago, I was a market researcher conducting interviews on the street and Huff's bias comments rang very true with me - I faced these problems on a daily basis. In my experience, even people with a very good statistical education aren't aware of survey flaws and sources of bias.
The chapter on averages still holds up. Huff shows how the mean can be distorted and why the median might be a better choice. I've interviewed people with Master's degrees in statistics who couldn't explain why the median might be a better choice of average than the mean, so I guess there's still a need for the lesson.
One area where I think things have moved in the right direction is the decreasing use of some types of misleading charts. Huff discusses the use of images to convey quantitative information. He shows a chart where steel production was represented by images of a blast furnace (see below). The increase in production was 50%, but because the height and width were both increased, the area consumed by the images increases by 150%, giving the overall impression of a 150% increase in production1. I used to see a lot of these types of image-based charts, but their use has declined over the years. It would be nice to think Huff had some effect.
Staying with charts, his discussion about selecting axis ranges to mislead still holds true and there are numerous examples of people using this technique to mislead every day. I might write a blog post about this at some point.
He has chapters on the post hoc fallacy (confusing correlation and causation) and has a nice explanation of how percentages are regularly mishandled. His discussion of general statistical deceitfulness is clear and still relevant.
Unfortunately, the book hasn't aged very well in other aspects. 2020 readers will find his language sexist, the jokey drawings of a smoking baby are jarring, and his roundabout discussion of the Kinsey Reports feels odd. Even the writing style is out of date.
Huff himself is tainted; he was funded by the tobacco industry to speak out against smoking as a cause of cancer. He even wrote a follow-up book, How to lie with smoking statistics to debunk anti-smoking data. Unfortunately, his source of authority was the wide-spread success of this book. How to lie with smoking statistics isn't available commercially anymore, but you can read about it on Alex Reinhart's page.
Despite all its flaws, I recommend you read this book. It's a quick read and it'll give you a grounding in many of the problems of statistical analysis. If you're a business person, I strongly recommend it - its lessons about cautiously interpreting analysis still hold.
This is a flawed book by a flawed author but it still has a lot of value. I couldn't help thinking that the time is probably right for a new popular book on how people are lying and misleading you using charts and statistics.
 Colin Warwick pointed out an error in my original text. My original text stated the height and width of the second chart increased by 50%. That's not quite what Huff said. I've corrected my post.
Wednesday, January 22, 2020
Matplotlib is the granddaddy of chart visualization in Python, it offers most of the functionality you might want and is available with almost every Python distribution. Unfortunately, its longevity is also its problem. Matplotlib was originally based on MATLAB’s charting features, which were in turn developed in the 1980’s. Matplotlib's longevity has left it with an awkward interface and some substantially out-of-date defaults. In recent years, the Matplotlib team has updated some of their visual defaults and offered new templates that make Matplotlib charts less old-fashioned, however, the library is still oriented towards non-interactive charts and its interface still leaves much to be desired.
Seaborn sits on top of Matplotlib and provides a much more up-to-date interface and visualization defaults. If all you need is a non-interactive plot, Seaborn may well be a good option; you can produce high-quality plots in a rational way and there are many good tutorials out there.
Plotly provides static chart visualizations too, but goes a step further and offers interactivity and the ability to build apps. There are some great examples of Plotly visualizations and apps on the web. However, Plotly is a paid-for solution; you can do most of what you want with the free tier, but you may run into cases where you need to purchase additional services or features.
Altair is another plotting library for Python based on the 'grammar of graphics’ concept and the Vega project. Altair has some good features, but in my view, it isn’t as complete as Bokeh for business analytics.
Bokeh is an ambitious attempt to offer D3 and ggplot2-like charts plus interactivity, with visualizations rendered in a browser, all in an open-source and free project. Interactivity here means having tools to zoom into a chart or move around in the chart, and it means the ability to create (browser-based) apps with widgets (like dropdown menus) similar to Shiny. It’s possible to create chart-based applications and deploy them via a web server, all within the Bokeh framework. The downside is, the library is under active development and therefore still changing; some applications I developed a year ago no longer work properly with the latest versions. Having said all that, Bokeh is robust enough for commercial use today, which is why I’ve chosen it for most of my visualization work.
Holoviews sits on top of Bokeh (and other plotting engines) and offers a higher-level interface to build charts using less coding.
It’s very apparent that the browser is becoming the default visualization vehicle for Python. This means I need to mention Flask, a web framework for Python. Although Bokeh has a lot of functionality for building apps, if you want to build a web application that has forms and other typical features of web applications, you should use Flask and embed your Bokeh charts within it.
If you’re confused by the various plotting options, you’re not alone. Making sense of the Python visualization ecosystem can be very hard, and it can be even harder to choose a visualization library. I looked at the various options and chose Bokeh because I work in business and it offered a more complete and reliable solution for my business needs. In a future blog post, I'll give my view of where things are going for Python visualization.
Saturday, January 18, 2020
Over the last few years, I’ve done a lot of hiring across many disciplines: analytics, data science, product management, engineering, sales, and HR. I’ve learned a lot about what makes a good candidate and what makes a bad candidate. Because I’m writing this blog for people interested in analytics and data science, I'm going to share some of the things that I think are likely to improve your chances of getting hired for technical positions.
Hiring is risky
The key thing to remember is hiring is a tremendously risky process for the employer. It’s very painful to unwind a poor hiring decision, so for the most part, the interview team is not inclined to take risks. You have to satisfy the technical requirements for the job, but also the social requirements too. The interview team will be deciding whether or not you’re a fit for the team - can they work with you? There are all kinds of clues they use to decide this and I’ll cover some of them here.
Candidates make amazing blunders with resumes. I’ve seen odd layouts, poor wording, and incredibly long resumes (15+ pages in one case). Here are some simple rules:
- Length: one page if you’re junior, two pages (at most) if you’re senior.
- Layout: single-column layouts - keep it simple.
- Keywords: your resume should use every relevant keyword as many times as it makes sense. For example, if you have machine learning experience, use the term. Resumes are often keyword screened and if you don’t have the keywords you’ll be ruled out by a machine.
- Contact details: name, city, phone number, email. I always give local candidates preference, but I have to know you’re local.
Your resume gives clues to how well prepared you are (back to the risk thing), a bad resume indicates you haven’t taken advice, or you don’t care, or you’re naive - none of which are good. There are plenty of good resources out there for building resumes. Northeastern University does an incredible job preparing its candidates for work, including some great coaching on resume building. They have an excellent website on resumes with lots of strong guidance.
One great piece of advice I’ve heard is to customize your resume for the employer or industry you’re targeting. Some candidates are considering different employment areas but they have a single resume they’re trying to use for everything. You should have a different resume focusing on different areas for each industry you're targeting. If you have time, you should tweak your resume for each employer. Remember that customization is as much about what you leave out as what you leave in. For example, if you have wet bench experience but you’re applying for computing positions, you should shorten (or remove) your wet bench sections and increase the length of your software sections. The logic here is simple, you have limited space, so why tell an employer about something irrelevant to them? For me, there’s a minor exception - I do like candidates with something unusual about them, but a single resume line is usually enough (e.g. ‘wet bench qualifications’, ‘EMT qualified’).
I love it when candidates have a Github page they put on their resume. If they pass the screening interview, I check out their page and what they’ve done. You do need to be careful though, I’ve seen some bad code that’s put me off a candidate. Github is especially great if you’re trying to do some kind of career transition into analytics or data science from some other field. If you’re transitioning, you can’t talk about what you’ve done in your current role as proof of your capabilities, but you can talk about the Github projects you’ve created in your own time. In fact, creating a project in your own time to display your work shows a tremendous amount of commitment. If you have projects to put up on Github, do so, it’s a great place to demonstrate your talent.
Be prepared - and turn your camera on
If you haven’t interviewed in a while, it’s a good idea to reach out to your connections and ask for a practice interview. You could also ask your friends for a review of your resume. Of course, you should remember that if people help you, in turn, you should help people.
For heaven’s sake, be technically prepared for the interview. Nowadays, many interviews are conducted via a computer video call (e.g. Skype, Zoom, etc.). There’s almost always software to download and install. Make sure you have the software installed and running before the call. I interviewed someone for a management position who took 20 minutes to download the software and get into the interview. Not good when you’re interviewing for a position that requires experience and forward planning!
For video interviews, I have two pieces of advice: turn your camera on and consider where you do the interview. It’s a video interview for a reason and it looks odd if you don’t turn your camera on. I was once told that the reason why a candidate didn’t turn their camera on was that they’d had an unusual hair treatment just before the call. Your hair is your business, but why not schedule the call for some other time? You also need to consider your background; what will the interviewer see? I was once interviewed by someone from a hotel bedroom with their underwear strewn everywhere in the frame - it didn’t create a professional impression. One candidate I interviewed had their laptop on their knees for the interview; every time they moved the entire video frame heaved like a ship in a storm and by the end of the interview, I felt seasick. Try to avoid distracting locations and distracting items in the frame.
Preparation also means understanding who will interview you and what the interview will cover. I’ve interviewed candidates who were surprised to be asked technical questions when the interview briefing clearly said that would happen. Of course, you must look up everyone on LinkedIn beforehand and know their roles - you might even get insight into the questions they might ask.
Examples and questions
A few years ago I did a course on behavior-based interviewing. There were lots of great pieces in the course but it can be boiled down to one simple idea: give examples for everything you claim. For example, if you claim to be a good planner, give examples of how you planned well, if you claim to know Python, point to examples (e.g. Github), and so on. The idea is you’re providing proof - doing is better than saying.
Make sure you have plenty of questions for each interviewer. It shows you’re prepared and engaged, and of course, you might learn something useful. It’s also expected. If you can, get every interviewer’s email address, you’ll need it later.
At the end
When it’s all over, send a thank you email to everyone who interviewed you. For any kind of customer-facing role, this is expected and it’s increasingly expected for technical roles too.
If you don’t get the job, there’s one last thing you can do. If you got on well with the interview team, ask for feedback Not every interview team will do it, but some will and you can learn a lot from them about why you didn’t get the job.
Bear in mind that a lot of what I’ve said is about reducing risk for the employer in choosing you. Being prepared for the interview (software download, video call background, interview questions, etc.) shows you take it all seriously and gives clues to what you’ll be like as an employee. Asking questions at the interview and thanking everyone shows you know about social conventions and could be a good fit for the team.
Thursday, January 16, 2020
Correlation is not causation
Because they’ve misunderstood one of the main rules of statistical evidence, I’ve seen people make serious business mistakes and damage their careers. The rule is a simple, but subtle one: correlation is not causation. I’m going to explain what this means and show you cases where it’s obviously true, and some cases where it’s less obvious. Let’s start with some definitions.
Clearly, causation means one thing causes another. For example, prolonged exposure to ultraviolet light causes sunburn, the Vibrio cholerae bacteria causes cholera, and recessions cause bankruptcies.
What is correlation?
Correlation occurs when two things vary in the same way. For example, lung cancer rates vary with the level of smoking, commuting times vary with the state of the economy, and health and longevity are correlated with income and wealth. The relationship usually becomes clear when we plot the data out, but it’s very rarely perfect. To give you a sense of what I mean, I’ve taken the relationship between brain mass and body mass in mammals and plotted the data below, each dot is a different type of mammal [Rogel-Salazar].
The straight line on the chart is a fit to the data. As you can see, there’s a relationship between brain and body mass but the dots are spread.We measure how well two things are correlated with something called the correlation coefficient, r. The closer r is to 1 (or -1), the better the correlation (this is a gross simplification). I typically look for r to be 0.8 (or < -0.8) or better. For the brain and body data above, r is 0.89, so the correlation is ‘good’.
For causation to exist, to say that A causes B, we must be able to observe the correlation between A and B. If sunscreen is effective at reducing sunburn we should observe increased sunscreen use leading to reduced sunburn. However, we need more than correlation to prove causation (I’m skipping over details to keep it simple).
Correlations does not imply causation
Here’s the important bit: correlation does not imply causation. Just because two things are correlated does not imply that one causes the other. Two things could be very well correlated and there could be no causal relationship between them at all. There could be a confounding factor that causes both variables to move in the same way. In my view, misunderstanding this is the single biggest problem in data analysis.
The excellent website Spurious Correlations shows the problem in a fun way, I’ve adapted an example from the website to illustrate my point. Here are two variables I've shown varying with time.
Imagine one of the variables was sales revenue and the other was the number of hours of sales effort. The correlation between them is very high (r=0.998). Would you say the amount of sales effort causes the sales revenue? If sales revenue was important to you, would you invest in more sales hours? If I presented this evidence to you in an executive meeting, what would you say?
Actually, I lied to you. The red line is US spending on science, space, and technology and the black line is suicides by hanging, strangulation and suffocation. How can these things be related to each other? Because there’s some other variable or variables both of them depend on, or frankly, just by chance. Think for a minute what happens as an economy grows, all kinds of expenditure goes up; sales of expensive wine go up, and people spend more on their houses. Does that mean sales of expensive wine cause people to spend more on houses?
(On the spurious correlations website there are a whole bunch of other examples, including: divorce rates in Maine correlated with per capita consumption of margarine, total revenue generated by arcades is correlated with the age of Miss America, and letters in the winning word of the Scripps National Spelling Bee are correlated with number of people killed by venomous spiders.)
The chart below shows the relationship between stork pairs and human births for several European locations 1980-1990 [Matthews]. Note r is high at 0.85.
My other (possibly apocryphal) example concerns lice. In Europe in the middle ages, lice were considered beneficial (especially for children) because sick people didn’t have as many lice [Zinsser]. Technically, this type of causation mistake is known as the post hoc ergo propter hoc fallacy if you want to look it up.
The causation/correlation problem often rears its ugly head in sales and marketing. Here are two examples I’ve seen, with the details disguised to protect the guilty.
I’ve seen a business analyst present the results of detailed sales data modeling and make recommendations for change based on the correlation/causation confusion. The sales data set was huge and they’d found a large number of correlations in the data (with good r values). They concluded that these correlations were causation, for example, in area X sales scaled with the number of sales reps and they concluded that more reps = more sales. They made a series of recommendations based on their findings. Unfortunately, most of the relationships they found were spurious and most of their recommendations and forecasts were later found to be wrong. The problem was, there were other factors at play that they hadn’t accounted for. It doesn’t matter how complicated the model or how many hours someone has put in, the same rule applies; correlation does not imply causation.
The biggest career blunder I saw was a marketing person claiming that visits to the company website were driving all company revenue, I remember them talking about the correlation and making the causation claim to get more resources for their group. Unfortunately, later on, revenue went down for reasons (genuinely) unrelated to the website. The website wasn’t driving all revenue - it was just one of a number of factors, including the economy and the product. However, their claim to be driving all revenue wasn’t forgotten by the executive team and the marketing person paid the career price.
Here’s what I think you should take away from all this. Just because two things appear to be correlated doesn’t mean there’s causation. In business, we have to make decisions on the basis of limited evidence and that’s OK. What’s not OK is to believe there’s evidence when there isn’t - specifically to infer causation from correlation. Statistics and experience teach us humility. The UK Highway Code has some good advice here, a green light doesn’t mean go, it means ‘proceed with caution'.
[Matthews] ‘Storks Deliver Babies (p=0.008)’, Robert Matthews, Teaching Statistics. Volume 22, Number 2, Summer 2000
[Rogel-Salazar] Rogel-Salazar, Jesus (2015): Mammals Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1565651.v1
[Zinsser] ‘Rats, lice, and history’, Hans Zinsser, Transaction Publishers, London, 2008
Saturday, January 11, 2020
Pretty, but misleading
You see choropleth maps everywhere: on websites, on the TV news, and in applications. They’re very pretty, they appeal to our sense of geography, but they can be horribly misleading. I’m going to show you why that’s the case and show you ways designers have sought to get around their problems.
What's a choropleth map?
A choropleth map is a geographic map with regions colored according to some criteria. A great example is election maps used in US Presidential elections. Each of the states is colored according to the party that won the state. Here’s an example result from a US Presidential election. Can you see what the problem is?
Looking at the map, who do you think won the election (I’m deliberately not telling you which election)? Do you think this election was a close one?
The trouble is, the US population density varies considerably from state to state, as does the number of Electoral College votes. In 2020, Rhode Island will have 4 Electoral College votes compared to Montana’s 3, but Montana is 120 times larger on the map. If you just glance at most US Presidential election choropleth maps, it looks like the Republican candidate won, even when he didn’t. The reason is, the geographically large rural states are mostly Republican but have few Electoral College votes because their populations are relatively low. So the election choropleth map looks mostly red. Our natural tendency is to assume more ink = more important, but the choropleth map breaks this relationship giving a misleading representation.
By the way, the map I showed you is from the 1976 election, in which Jimmy Carter won 267 Electoral College votes to Gerard Ford's 240 (24 states to 27). Did you get who won from the map? Did you get the size of the victory?Let’s take another example, the 2014 Scottish Independence referendum. Here’s the result by council area (local government area), pink for remain, green for independence. In this case, the remainers won, but what do you think the margin was? Was it close?
Despite the choropleth map’s overwhelming remain coloring, the actual result was 55%-45%. It looks like an overwhelming remain victory because the Scottish population is concentrated in a few areas. As in the US, there are large rural areas with few people that take up large amounts of chart space, exaggerating their level of importance.
Can we somehow represent data using some kind of map in a more proportional way? Cartograms distort the underlying geography to better represent some underlying variable. There’s a great example on the Geographical.co.uk website for the UK 2019 general election.
Cartograms in the house of mirrors
Mark Newman at the University of Michigan has a site presenting the 2016 US Presidential election as cartograms which is worth a look. Here's his map scaling the states to their Electoral College votes. Because the proportion of red and blue ink follows the Electoral College votes, the cartogram gives a fairer representation of the result (I find this representation easier to understand than the 2019 UK election result in the third chart above.).
An alternative approach is to use hexagons to represent the result:
The hexagon representation has become much more popular in recent years, leading to designers calling this kind of chart a hexagram. As you might have guessed, on the whole, I prefer this type of cartogram.
All this is great in theory, but in practice there are problems. Hexagrams are great, but they're still unfamiliar to many users and might require explanation. They can also distort geography, reducing the display's usefulness, for example, can you easily identify Utah on the 2016 US Presidential hexagram above? Most packages and modules that display data don't yet come with out-of-the-box cartograms, meaning developers have to create something from scratch, which takes more effort.
What I do
Here's my approach: use choropleth maps for planning and hexagrams or other charts to represent quantitative results. Planning usually involves some sense of geography, for example, territory allocation in sales, in this case, a choropleth map can be useful because of its close ties to the underlying geography. To represent quantitative information (like election results), I prefer bar charts or other traditional charts. If you want something that's more geographical, I recommend some form of hexagram, but with the warning that you might have to build it yourself which can be very time-consuming.
Finding out more
Danny Dorling has written extensively about cartograms and I recommend his website: http://www.dannydorling.org/
The WorldMapper website presents lots of examples of cartograms using social and political data: https://worldmapper.org/