Monday, October 19, 2020

Stylish Pandas in the frame

The data can't be right, it's so ugly

Despite what many technical people want to believe, well-presented data is more convincing than badly presented data. Unfortunately, the default way Pandas outputs dataframes as tables is ugly. I'm going to show you how to make Pandas dataframes (tables) very pretty and hopefully more convincing.

(A very attractive panda. Image source: Wikimedia Commons. Author: Christian Mehlführer. License: Creative Commons.)

Ugly Betty

My dataset is the results of the 2019 UK general election: the number of MPs and voters per party. Here's my Pandas dataframe (I've called it parliament for some reason):

                           party  MPs     votes  MPs frac  votes frac

0                   Conservative  365  13966565  0.561538    0.452447

1                         Labour  202  10269076  0.310769    0.332667

2           Scottish Nationalist   48   1242380  0.073846    0.040247

3              Liberal Democrats   11   3696423  0.016923    0.119746

4           Democratic Unionists    8    244128  0.012308    0.007909

5                      Sinn Fein    7    181853  0.010769    0.005891

6                    Plaid Cymru    4    153265  0.006154    0.004965

7   Social Democratic and Labour    2    118737  0.003077    0.003846

8                          Green    1    835579  0.001538    0.027069

9                       Alliance    1    134115  0.001538    0.004345

10                       Speaker    1     26831  0.001538    0.000869

If we output the dataframe to HTML using parliament.to_html(), here's what we get by default. It looks amateurish. Let's make it nicer.

partyMPsvotesMPs fracvotes frac
0Conservative365139665650.5615380.452447
1Labour202102690760.3107690.332667
2Scottish Nationalist4812423800.0738460.040247
3Liberal Democrats1136964230.0169230.119746
4Democratic Unionists82441280.0123080.007909
5Sinn Fein71818530.0107690.005891
6Plaid Cymru41532650.0061540.004965
7Social Democratic and Labour21187370.0030770.003846
8Green18355790.0015380.027069
9Alliance11341150.0015380.004345
10Speaker1268310.0015380.000869

Adding style

Pandas dataframes have a style property we can use to customize the appearance of the dataframe and its HTML rendering too. The style property returns a Styler object we can use to make changes to the way the data is rendered as a HTML table.  I'm going to add style and show you what the rendered HTML looks like.

Precision, thousands, and hiding the index

The fraction of votes and MPs has six decimal places, which is the default for Python formatting. Let's change the fractional numbers to three decimal places, introduce thousand separators for the number of votes, and hide the index. Here's the code to do it:

parliament.style.format(
    {"MPs frac":"{:.3f}",
     "votes frac":"{:.3f}",
     "votes": "{:,}"}
    ).hide_index().render()

In this case, the style.format code takes a dict argument. The dict keys are the dataframe column names and the dict values are the formatting instructions. Most Python formatters work with this method but some don't; for example, the alignment Python formatters don't work. Here's what the rest of the code means:

  • {:.3f} truncates the floating point numbers to three decimal places
  • {:,} introduces the thousand separator
  • hide_index hides the index
  • render renders the table using HTML - it produces a string output of HTML text

The arguments to format don't have to be a dict, but using a dict makes it easier if you're changing several columns at once.

Here's the HTML output from the code above. It's a big improvement, but not quite what we want.

party MPs votes MPs frac votes frac
Conservative 365 13,966,565 0.562 0.452
Labour 202 10,269,076 0.311 0.333
Scottish Nationalist 48 1,242,380 0.074 0.040
Liberal Democrats 11 3,696,423 0.017 0.120
Democratic Unionists 8 244,128 0.012 0.008
Sinn Fein 7 181,853 0.011 0.006
Plaid Cymru 4 153,265 0.006 0.005
Social Democratic and Labour 2 118,737 0.003 0.004
Green 1 835,579 0.002 0.027
Alliance 1 134,115 0.002 0.004
Speaker 1 26,831 0.002 0.001

Column alignment and spacing

Let's right-align the columns and add a bit more spacing between columns.

parliament.style.format(
    {"MPs frac":"{:.3f}",
     "votes frac":"{:.3f}",
     "votes": "{:,}"})
    .set_properties(**{'text-align': 'right',
                       'padding':'0 15px'})
    .hide_index().render()

The set_properties method sets the CSS properties of the HTML object, in this case, the table. 

Here's the output:


party MPs votes MPs frac votes frac
Conservative 365 13,966,565 0.562 0.452
Labour 202 10,269,076 0.311 0.333
Scottish Nationalist 48 1,242,380 0.074 0.040
Liberal Democrats 11 3,696,423 0.017 0.120
Democratic Unionists 8 244,128 0.012 0.008
Sinn Fein 7 181,853 0.011 0.006
Plaid Cymru 4 153,265 0.006 0.005
Social Democratic and Labour 2 118,737 0.003 0.004
Green 1 835,579 0.002 0.027
Alliance 1 134,115 0.002 0.004
Speaker 1 26,831 0.002 0.001

Colors

The political parties have colors, so it would be nice to show their party colors as a background to their names, meaning we should change the background colors of the party column. It might also be nice to highlight the maximum results in a light gray. Maybe we can get really clever and add a bar chart representing the number of seats won. Here's the code to do all that:

styles = [dict(selector='.col1', 
               props=[('width', '50px')])]

def colors(value):
    partymap = {'Conservative': 'lightblue',
                'Labour': 'salmon',
                'Scottish Nationalist' : 'yellow',
                'Liberal Democrats': 'orange',
                'Democratic Unionists': 'orange',
                'Sinn Fein': 'lightgreen' ,
                'Plaid Cymru': 'lightgreen',
                'Social Democratic and Labour': 'salmon',
                'Green' : 'lightgreen',
                'Alliance': 'orange',
                'Speaker': 'lightgray'}
    return """background-color: {0}""".format(
        partymap[value])

parliament.style
          .format(
            {"MPs frac":"{:.3f}",
             "votes frac":"{:.3f}",
             "votes": "{:,}"})
          .set_properties(**{'text-align': 'right',
                             'padding':'0 15px'})
          .bar(subset=['MPs'], color='lightgray')
          .set_table_styles(styles)
          .applymap(colors, 
                    subset=['party'])
          .highlight_max(color='#F8F9F9')
          .hide_index().render()

Here's what this code does:

  • bar takes the data in the column and draws a bar chart based on it. It uses the full width of the column and expands the column if necessary, hence my need to restrict the column width to get the table to fit on the Blogger page correctly.
  • The party background colors are created with applymap method using the colors function applied to just the party column using the subset argument.
  • The maximum value highlighting I do with the highlight_max built-in method and I highlight the cells a very light gray. 
  • The method set_table_styles restricts the width of the MPs column so the page renders on Blogger; it uses a CSS selector to do it, and of course you could use the same approach for fine grain formatting using CSS. 
  • The subset argument restricts formatting to just the specified columns.

Here's what the final results look like:

party MPs votes MPs frac votes frac
Conservative 365 13,966,565 0.562 0.452
Labour 202 10,269,076 0.311 0.333
Scottish Nationalist 48 1,242,380 0.074 0.040
Liberal Democrats 11 3,696,423 0.017 0.120
Democratic Unionists 8 244,128 0.012 0.008
Sinn Fein 7 181,853 0.011 0.006
Plaid Cymru 4 153,265 0.006 0.005
Social Democratic and Labour 2 118,737 0.003 0.004
Green 1 835,579 0.002 0.027
Alliance 1 134,115 0.002 0.004
Speaker 1 26,831 0.002 0.001

Commentary

It's nice that Pandas has this functionality, and it's nice that it's as extensive as it is, but there's a problem. The way style is implemented is inconsistent and hard to understand, for example, some but not all of the string formatters work, and there are two methods that do very similar things (set_table_styles and set_properties). In practice, it takes more time and it's harder than it needs to be to get good results. The code looks ungainly too.  It is what it is for now.

Next steps

You can do some other clever things with style, like apply heatmaps, or apply clever conditional table formatting. You can really make your data output standout, but be careful, you can go overboard! To find out more, read the Pandas dataframe style documentation

Monday, October 12, 2020

Fundamentally wrong? Using economic data as an election predictor

What were you thinking?

Think back to the last time you voted. Why did you vote the way you did? Here are some popular reasons, how many apply to you?

  • The country's going in the wrong direction, we need something new.
  • My kind of people vote for party X, or my kind of people never vote for party Y.
  • I'm a lifelong party X voter.
  • Candidate X or party X is best suited to running the country right now.
  • Candidate Y or party Y will ruin the country.
  • Candidate Y or party X are the best for defense/the economy/my children's education and that's what's important to me right now.

(Ballot drop box. Image Source: Wikimedia Commons. Author: Paul Sableman. License: Creative Commons.)

Using fundamentals to forecast elections

In political science circles, there's been a movement to use economic data to forecast election results. The idea is, homo economicus is a rational being whose voting behavior depends on his or her economic conditions. If the economy is going well, then incumbents (or incumbent parties) are reelected, if things are going badly, then challengers are elected instead. If this assertion is true, then people will respond rationally and predictably to changing economic circumstances. If we understand how the economy is changing, we can forecast who will win elections.

Building models based on fundamentals follows a straightforward process:

  1. Choose an economic indicator (e.g. inflation, unemployment, GDP) and see how well it forecasts elections.
  2. Get it wrong for an election.
  3. Add another economic indicator to the forecast to correctly predict the wrong election.
  4. Get it wrong for an election.
  5. Either re-adjust the model weights or go to 3.

These models can get very sophisticated. In the United States, some of the models include state-level data and make state-level forecasts of results.

What happens in practice

Two University of Colorado professors, Berry and Bickers, followed this approach to forecast the 2012 presidential election.  They very carefully analyzed elections back to 1980 using state-level economic data.  Their model was detailed and thorough and they helpfully included various statistical metrics to guide the reader to understand the model uncertainties. Their forecast was very clear: Romney would win 330 electoral college votes - a very strong victory. As a result, they became darlings for the Republican party.

Unfortunately for them, things didn't work out that way. The actual result was 332 electoral college votes for Obama and 206 for Romney, an almost complete reversal of their forecast.

In a subsequent follow-up (much shorter than their original paper), the professors argued in essence that although the economy had performed poorly, voters didn't blame Obama for it. In other words, the state of the economy was not a useful indicator for the 2012 election, even considering state-level effects.

This kind of failure is very common for fundamentals. While Nate Silver was at the New York Times, he published a long piece on why and how these models fail. To cut to the chase, there is no evidence voters are homo economicus when it comes to voting. All kinds of factors affect how someone votes, not just economic ones. There are cultural, social class, educational, and many other factors at work.

Why these models fail - post hoc ergo propter hoc and spurious correlations

The post hoc fallacy is to assume that because X follows Y, Y must cause X. In election terms, the fundamentalists assume that an improving or declining economy leads to certain types of election results. However, as we've said, there are many factors that affect voting. Take George Bush's approval rating, in the aftermath of 9/11 it peaked around 88% and he won re-election in 2004. Factors other than the economy were clearly at work.

A related phenomenon is spurious correlations which I've blogged about before. Spurious correlations occur when two unrelated phenomena show the same trend and are correlated, but one does not cause the other. Tyler Vigen has a great website that shows many spurious correlations.

Let's imagine you're a political science researcher. You have access to large amounts of economic data and you can direct your graduate students to find more. What you can do is trawl through your data set to find economic or other indicators that correlate with election results. To build your model, you weigh each factor differently, for example, inflation might have a weighting of 0.7 and unemployment 0.9. Or you could even have time-varying weights. You can then test your model against existing election results and publish your forecast for the next election cycle. This process is almost guaranteed to find spurious correlations and produce models that don't forecast very accurately. 

Forecasting using odd data happens elsewhere, but usually, more entertainingly. Paul the Octopus had a good track record of forecasting the 2010 World Cup and other football results - Wikipedia says he had an 85.7% success rate. How was he so successful? Probably dumb luck. Bear in mind, many animals have been used for forecasting and we only hear about the successful ones.



(Paul the Octopus at work. Image source: Wikimedia Commons. License: Creative Commons.)

To put it simply, models built with economic data alone are highly susceptible to error because there is no evidence voters consider economic factors in the way that proponents of these models suggest. 

All models are wrong - some are useful

The statistician George Box is supposed to have said, "all models are wrong, some are useful". The idea is simple, the simplifications involved in model building often reduce their fidelity, but some models produce useful (actionable) results. All election forecast models are just that, forecast models that may be right or wrong. The question is, how useful are they? 

Let's imagine that a fundamental model was an accurate forecaster. We would have to accept that campaigns had little or no effect on the outcome. But this is clearly at odds with reality. The polling data indicates that the course of the 2016 US presidential election changed course in the closing weeks of the campaign. Perhaps most famously, the same thing happened in 1948. One of the key issues in the 2004 US presidential election was the 'war on terror'. This isn't an economic effect and it's not at all clear how it could be reduced to a number.

In other words, election results depend on more than economic effects and may depend on factors that are hard to quantify.

To attempt to quantify these effects, we could turn to opinion polls. In 2004, we could have asked voters about their view of the war on terror and we could have factored that into a fundamentals model. But why not just ask them how they intend to vote?


(Paul the Octopus died and was memorialized by a statue. How many other forecasters will get statues? Image Source: Wikimedia Commons. Author: Christophe95. License: Creative Commons.)

Where I stand

I'm reluctant to throw the baby out with the bathwater. I think fundamentals may have some effect, but it's heavily moderated by other factors and what happens during the campaign. Maybe their best use might be to give politicians some idea of factors that might be important in a campaign. But as the UK Highway Code says of the green traffic light, it doesn't mean go, it means "proceed with caution".

If you liked this post, you might like these ones

Wednesday, October 7, 2020

Opinion polling blog posts

Why a 'greatest hits' polling blog post?

Over the past few months, I've blogged about elections and opinion polling several times. On October 8th, 2020, I gave a talk at PyData Boston on forecasting US presidential elections, and I thought I would bring these blog posts together into one convenient place so the people at the talk could more easily find them.

(Mexican bird men dancing on a pole. I subtitled my talk on opinion polls 'poll dancing' - and I'm sure I disappointed my audience as a result. Image credit: Wikimedia Commons. License: Creative Commons. Author: Juan Felipe Rios.)

Polling

Can you believe the polls? - fake polls, leading questions, and other sins of opinion polling.

President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong - why the polls said Clinton would win and why Trump did.

Poll-axed: disastrously wrong opinion polls - a brief romp through some disastrously wrong opinion poll results.

Sampling the goods: how opinion polls are made - my experiences working for an opinion polling company as a street interviewer.

Probability theory

Who will win the election? Election victory probabilities from opinion polls - a quick derivation of a key formula and an explanation of why random sampling alone underestimates the uncertainty.

US democracy

These blog posts provided some background on US presidential elections.

The Electoral College for beginners - the post explains how the electoral college works and how it came to be.

Finding electoral fraud - the democracy data deficit - the post looks at the evidence (or the lack of it) for vote fraud and suggests a way citizen-analysts can contribute to American democracy.

Silkworm - lessons learned from a BI app in Python

Faster Python BI app development through code generation - how I generated the code for the Silkworm project and why I did it.

Tuesday, October 6, 2020

Faster Python BI app development through code generation

Back to the future: design like it's 1999

Back in 1999, you could build Visual Basic apps by dragging and dropping visual components (widgets) onto a canvas. The Visual Basic IDE handled all the code generation, leaving you with the task of wiring up your new GUI to your business data. It wasn't just Visual Basic though, you could do the same thing with Visual C++ and other Microsoft versions of languages. The generated code wasn't the prettiest, but it worked, and it meant you could get the job done quickly.

(Microsoft Visual Basic. Image credit: Microsoft.)

Roll forward twenty years. Python is now very popular and people are writing all kinds of software using it, including software that needs UIs. Of course, the UI front-end is now the browser, which is another change. Sadly, nothing like the UI-building capabilities of the Microsoft Visual Studio IDE exists for Python; you can't build Python applications by dragging and dropping widgets onto a canvas.

Obviously, BI tools like Tableau and Qlik fulfill some of the need to quickly build visualization tools; they've inherited the UI building crown from Microsoft. Unfortunately, they run out of steam when the analysis is complex; they have limited statistical capabilities and they're not good as general-purpose programming languages.

If your apps are 'simple', obviously, Tableau or Qlik are the way to go. But what happens if your apps involve more complex analysis, or if you have data scientists who know Python but not Tableau?

What would it take to make a Visual Basic or Tableau-like app builder for Python? Could we build something like it?

Start with the end in mind

The end goal is to have a drag-and-drop interface that looks something like this.

(draw.io. Image credit: draw.io.)

On the left-hand side of the screenshot, there's a library of widgets the user can drag and drop onto a canvas. 

Ideally, we'd like to be able to design a multi-tabbed application and move widgets onto each tab from a library. We'd do all the visualization layout on the GUI editor and maybe set up some of the properties for the widgets from the UI too. For example, we might set up the table column names, or give a chart a title and axis titles. When we're done designing, we could press a button and generate outline code that would create an application with the (dummy) UI we want.

A step further would be to import existing Python code into the UI editor and move widgets from tab to tab, add new widgets, or delete unwanted widgets.

Conceptually, all the technology to do this exists right now, just not in one place. Unfortunately, it would take considerable effort to produce something like it. 

If we can't go all the way, can we at least go part of the way?

A journey of a thousand miles begins with a single step

A first step is code generation from a specification. The idea is simple: you define your UI in a specification file that software uses to generate code. 

For this first simple step (and the end goal), there are two things to bear in mind:

  • Almost all UI-based applications can be constructed using a Model-View-Controller architecture (pattern) or something that looks like it.
  • Python widgets are similar and follow well-known rules. For example, the widgets in Bokeh follow an API; a button follows certain rules, a dropdown menu follows certain rules and so on.

Given that there are big patterns and small patterns here, we could use a specification file to generate code for almost all UI-based applications.

I've created software that does this, and I'm going to tell you about it.

JSON and the argonauts

Here's an overview of how my code generation software works.

  • The Model-View-Controller code exists as a series of templates, with key features added at code generation time.
  • The application is specified in a JSON file. The JSON file contains details of each tab in the application, along with details of the widgets on the tab. The JSON file must follow certain rules; for example, no duplicate names.
  • Most of the rules for code generation are in a JSON schema file that contains details for each Bokeh widget. For example, the JSON schema has rules for how to implement a button, including how to create a callback function for a button.

Here's how it works in practice.

  1. The user creates a specification file in JSON. The JSON file has details of:
    • The overall project (name, copyright, author etc.)
    • Overall data for each tab (e.g. name of each tab and a description of what it does).
    • For each tab, there's a specification for each widget, giving its name, its argument, and a comment on what it does.
  2. The system checks the user's JSON specification file for consistency (well-formed JSON etc.)
  3. Using a JSON schema file that contains the rules for constructing Bokeh widgets, the system generates code for each Bokeh widget in the specification.
    • For each widget that could have a callback, the system generates the callback code.
    • For complex widgets like DataTable and FileInput, the system generates skeleton example code that shows how to implement the widget. In the DataTable case, it sets up a dummy data source and table columns.
  4. The system then adds the generated code to the Model-View-Controller templates and generates code for the entire project.
    • The generated code is PEP8 compliant by design.
The generated code is runnable, so you can test out how the UI looks.

Here's an excerpt from the JSON schema defining the rules for building widgets:

            "allOf":[

                    {

                      "$comment":"███ Button ███",

                      "if":{

                        "properties":{

                          "type":{

                            "const":"Button"

                          }

                        }

                      },

                      "then":{

                        "properties":{

                          "name":{

                            "$ref":"#/definitions/string_template_short"

                          },

                          "description":{

                            "$ref":"#/definitions/string_template_long"

                          },

                          "type":{

                            "$ref":"#/definitions/string_template_short"

                          },

                          "arguments":{

                            "type":"object",

                            "additionalProperties":false,

                            "required":[

                              "label"

                            ],

                            "properties":{

                              "label":{

                                "type":"string"

                              },

                              "sizing_mode":{

                                "type":"string",

                                "default":"stretch_width"

                              },

                              "button_type":{

                                "type":"string",

                                "default":"success"

                              }

                            }

                          },

Here's an excerpt from the JSON file defining an application's UI:

{

      "name":"Manage data",

      "description":"Panel to manage data sources.",

      "widgets":[

        {

          "name":"ECV year allocations",

          "description":"Displays the Electoral College Vote allocations by year.",

          "type":"TextInput",

          "disabled":true,

          "arguments":{

            "title":"Electoral College Vote allocations by year in system",

            "value":"No allocations in system"

          }

        },

        {

          "name":"Election results",

          "description":"Displays the election result years in the system.",

          "type":"TextInput",

          "disabled":true,

          "arguments":{

            "title":"Presidential Election results in system",

            "value":"No allocations in system"

          }

What this means in practice

Using this software, I can very rapidly prototype BI-like applications. The main task left is wiring up the widgets to the business data in the Model part of the Model-View-Controller architecture. This approach reduces the tedious part of UI development but doesn't entirely eliminate it. It also helps with widgets like DataTable that require a chunk of code to get them working - this software generates most of that code for you.

How things could be better

The software works, but not as well as it could:

  • It doesn't do layout. Laying out Bokeh widgets is a major nuisance and a time suck. 
  • The stubs for Bokeh DataTable are too short - ideally, the generated code should contain more detail which would help reduce the need to write code.
  • The Model-View-Controller architecture needs some cleanup.

The roadmap

I have a long shopping list of improvements:

  • Better Model-View-Controller
  • Robust exception handling in the generated code
  • Better stubs for Bokeh widgets like DataTable
  • Automatic Sphinx documentation
  • Layout automation

Is it worth it?

Yes and no.

For straightforward apps, it will still be several times faster to write apps in Tableau or Qlik. But if the app requires more statistical firepower, or complex analysis, or linkage to other systems, then Python wins and this approach is worth taking. If you have access to Python developers, but not Tableau developers, then once again, this approach wins.

Over the longer term, regardless of my efforts, I can clearly see Python tools evolving to the state where they can compete with Qlik and Tableau for speed of application development.

Maybe in five years' time, we'll have all of the functionality we had 25 years ago. What's old is new again.