Showing posts with label Bokeh. Show all posts

Tuesday, October 6, 2020

Faster Python BI app development through code generation

Back to the future: design like it's 1999

Back in 1999, you could build Visual Basic apps by dragging and dropping visual components (widgets) onto a canvas. The Visual Basic IDE handled all the code generation, leaving you with the task of wiring up your new GUI to your business data. It wasn't just Visual Basic though, you could do the same thing with Visual C++ and other Microsoft versions of languages. The generated code wasn't the prettiest, but it worked, and it meant you could get the job done quickly.

(Microsoft Visual Basic. Image credit: Microsoft.)

Roll forward twenty years. Python is now very popular and people are writing all kinds of software using it, including software that needs UIs. Of course, the UI front-end is now the browser, which is another change. Sadly, nothing like the UI-building capabilities of the Microsoft Visual Studio IDE exists for Python; you can't build Python applications by dragging and dropping widgets onto a canvas.

Obviously, BI tools like Tableau and Qlik fulfill some of the need to quickly build visualization tools; they've inherited the UI building crown from Microsoft. Unfortunately, they run out of steam when the analysis is complex; they have limited statistical capabilities and they're not good as general-purpose programming languages.

If your apps are 'simple', obviously, Tableau or Qlik are the way to go. But what happens if your apps involve more complex analysis, or if you have data scientists who know Python but not Tableau?

What would it take to make a Visual Basic or Tableau-like app builder for Python? Could we build something like it?

Start with the end in mind

The end goal is to have a drag-and-drop interface that looks something like this.

(draw.io. Image credit: draw.io.)

On the left-hand side of the screenshot, there's a library of widgets the user can drag and drop onto a canvas.

Ideally, we'd like to be able to design a multi-tabbed application and move widgets onto each tab from a library. We'd do all the visualization layout on the GUI editor and maybe set up some of the properties for the widgets from the UI too. For example, we might set up the table column names, or give a chart a title and axis titles. When we're done designing, we could press a button and generate outline code that would create an application with the (dummy) UI we want.

A step further would be to import existing Python code into the UI editor and move widgets from tab to tab, add new widgets, or delete unwanted widgets.

Conceptually, all the technology to do this exists right now, just not in one place. Unfortunately, it would take considerable effort to produce something like it.

If we can't go all the way, can we at least go part of the way?

A journey of a thousand miles begins with a single step

A first step is code generation from a specification. The idea is simple: you define your UI in a specification file that software uses to generate code.

For this first simple step (and the end goal), there are two things to bear in mind:

Almost all UI-based applications can be constructed using a Model-View-Controller architecture (pattern) or something that looks like it.
Python widgets are similar and follow well-known rules. For example, the widgets in Bokeh follow an API; a button follows certain rules, a dropdown menu follows certain rules and so on.

Given that there are big patterns and small patterns here, we could use a specification file to generate code for almost all UI-based applications.

I've created software that does this, and I'm going to tell you about it.

JSON and the argonauts

Here's an overview of how my code generation software works.

The Model-View-Controller code exists as a series of templates, with key features added at code generation time.
The application is specified in a JSON file. The JSON file contains details of each tab in the application, along with details of the widgets on the tab. The JSON file must follow certain rules; for example, no duplicate names.
Most of the rules for code generation are in a JSON schema file that contains details for each Bokeh widget. For example, the JSON schema has rules for how to implement a button, including how to create a callback function for a button.

Here's how it works in practice.

The user creates a specification file in JSON. The JSON file has details of:

The overall project (name, copyright, author etc.)
Overall data for each tab (e.g. name of each tab and a description of what it does).
For each tab, there's a specification for each widget, giving its name, its argument, and a comment on what it does.

The system checks the user's JSON specification file for consistency (well-formed JSON etc.)
Using a JSON schema file that contains the rules for constructing Bokeh widgets, the system generates code for each Bokeh widget in the specification.

For each widget that could have a callback, the system generates the callback code.
For complex widgets like DataTable and FileInput, the system generates skeleton example code that shows how to implement the widget. In the DataTable case, it sets up a dummy data source and table columns.

The system then adds the generated code to the Model-View-Controller templates and generates code for the entire project.

The generated code is PEP8 compliant by design.

The generated code is runnable, so you can test out how the UI looks.

Here's an excerpt from the JSON schema defining the rules for building widgets:

"allOf":[

{

"$comment":"███ Button ███",

"if":{

"properties":{

"type":{

"const":"Button"

}

"then":{

"properties":{

"name":{

"$ref":"#/definitions/string_template_short"

"description":{

"$ref":"#/definitions/string_template_long"

"type":{

"$ref":"#/definitions/string_template_short"

"arguments":{

"type":"object",

"additionalProperties":false,

"required":[

"label"

"properties":{

"label":{

"type":"string"

"sizing_mode":{

"type":"string",

"default":"stretch_width"

"button_type":{

"type":"string",

"default":"success"

}

Here's an excerpt from the JSON file defining an application's UI:

{

"name":"Manage data",

"description":"Panel to manage data sources.",

"widgets":[

{

"name":"ECV year allocations",

"description":"Displays the Electoral College Vote allocations by year.",

"type":"TextInput",

"disabled":true,

"arguments":{

"title":"Electoral College Vote allocations by year in system",

"value":"No allocations in system"

}

{

"name":"Election results",

"description":"Displays the election result years in the system.",

"type":"TextInput",

"disabled":true,

"arguments":{

"title":"Presidential Election results in system",

"value":"No allocations in system"

}

What this means in practice

Using this software, I can very rapidly prototype BI-like applications. The main task left is wiring up the widgets to the business data in the Model part of the Model-View-Controller architecture. This approach reduces the tedious part of UI development but doesn't entirely eliminate it. It also helps with widgets like DataTable that require a chunk of code to get them working - this software generates most of that code for you.

How things could be better

The software works, but not as well as it could:

It doesn't do layout. Laying out Bokeh widgets is a major nuisance and a time suck.
The stubs for Bokeh DataTable are too short - ideally, the generated code should contain more detail which would help reduce the need to write code.
The Model-View-Controller architecture needs some cleanup.

The roadmap

I have a long shopping list of improvements:

Better Model-View-Controller
Robust exception handling in the generated code
Better stubs for Bokeh widgets like DataTable
Automatic Sphinx documentation
Layout automation

Is it worth it?

Yes and no.

For straightforward apps, it will still be several times faster to write apps in Tableau or Qlik. But if the app requires more statistical firepower, or complex analysis, or linkage to other systems, then Python wins and this approach is worth taking. If you have access to Python developers, but not Tableau developers, then once again, this approach wins.

Over the longer term, regardless of my efforts, I can clearly see Python tools evolving to the state where they can compete with Qlik and Tableau for speed of application development.

Maybe in five years' time, we'll have all of the functionality we had 25 years ago. What's old is new again.

Wednesday, March 18, 2020

Contributing to open-source software

I’ve been using open-source software packages for several years and always felt like a bit of a freeloader; I took, but I never gave back. My excuse was, I didn’t have time to dig into the codebase and familiarize myself with the project's ways of working. But recently, I found easier ways to contribute, and I have been.

(Image credit: Old Book Illustrations)

The first way I found is raising bugs. I’ve pushed open-source software quite hard and found bugs in Pandas and Bokeh. Both of these projects have Github pages and both of them have pages to report bugs. If you’re going to report a bug, here are some rules to follow:

Make sure you’re using the most up-to-date version of the software.
Make sure your bug hasn’t been raised before.
Provide a simple example to duplicate the bug.
Follow the rules for reporting bugs - especially with regard to formatting your report, the heading you use, and any tags.

The open-source community has quite rightly been criticized for occasional toxic behavior, some of which has come from software users. I’ve seen people raise bugs and been quite forceful in their criticisms of the software they’re freely using. Ultimately, open-source software is a volunteer effort and people don’t volunteer to face some of the nastiness I’ve seen. The onus is on you to remain courteous and professional, and part of that is taking the short amount of time to follow the rules. A little kindness and consideration goes a long way.

For reference, here are some bug reports I’ve raised:

Pandas: 20027
Bokeh: 8009, 8042

The second way to contribute is by suggesting new functionality. This is a little harder because it takes more consideration to make sure what you’re suggesting is relevant and hasn’t been suggested before. Once again, I strongly advocate that you find out what the rules are for requesting new functionality. If possible, I suggest you include a mock-up of what you’re suggesting.

For reference, here are some suggestions I’ve raised:

Pandas: 21551
Bokeh: 9144

The final way of contributing is to build a project that uses open-source technology, share it via Github (or the alternatives), and notify the community of your project. Bokeh has a nice showcase section on its Discourse server where you can see what people have built. Seeing what others have built is a great way to get inspiration for your own projects.

For reference, here’s a showcase project I made available for Bokeh.

On the whole, I’ve been very pleased with the response of the developer communities to my meager contributions. Most of my errors or suggestions have been implemented within a few months, which contrasts with my experience with paid-for software where there often isn’t a forum to view bugs or make suggestions.

If you’re a user of open-source software, I urge you to contribute in any way you can. We’re all in this together.

Tuesday, January 28, 2020

Future directions for Python visualization software

The Python charting ecosystem is highly fragmented and still lags behind R, it also lacks some of the features of paid-for BI tools like Tableau or Qlik. However, things are slowly changing and the situation may be much better in a few years' time.

(Image credit: Alvesgaspar - Wikimedia Commons - license)

Theoretically, the ‘grammar of graphics’ approach has been a substantial influence on visualization software. The concept was introduced in 1999 by Leland Wilkinson in a landmark book and gained widespread attention through Hadley Wickham’s development of ggplot2 The core idea is that a visualization can be represented as different layers within a framework, with rules governing the relationship between layers.

Bokeh was influenced by the 'grammar of graphics' concept as were other Python charting libraries. The Vega project seeks to take the idea of the grammar of graphics further and creates a grammar to specify visualizations independent of the visualization backend module. Building on Vega, the Altair project is a visualization library that offers a different approach from Bokeh to build charts. It’s clear that the grammar of graphics approach has become central to Python charting software.

If the legion of charting libraries is a negative, the fact that they are (mostly) built on the same ideas offers some hope for the future. There’s a movement to convergence by providing an abstraction layer above the individual libraries like Bokeh or Matplotlib. In the Python world, there’s precedence for this; the database API provides an abstraction layer above the various Python database libraries. Currently, the Panel project and HoloViews are offering abstraction layers for visualization, though there are discussions of a more unified approach.

My take is, the Python world is suffering from having a confusing array of charting library choices which splits the available open-source development efforts across too many projects, and of course, it confuses users. The effort to provide higher-level abstractions is a good idea and will probably result in fewer underlying charting libraries, however, stable and reliable abstraction libraries are probably a few years off. If you have to produce results today, you’re left with choosing a library now.

The big gap between Python and BI tools like Tableau and Qlik is the ease of deployment and speed of development. BI tools reduce the skill level to build apps, deploy them to servers, and manage tasks like access control. Projects like Holoviews may evolve to make chart building easier, but there are still no good, easy, and automated deployment solutions. However, some of the component parts for easier deployment exist, for example, Docker, and it’s not hard to imagine the open-source community moving its attention to deployment and management once the various widget and charting issues of visualization have been solved.

Will the Python ecosystem evolve to be as good as R’s and be good enough to take on BI tools? Probably, but not for a few years. In my view, this evolution will happen slowly and in public (e.g. talks at PyCon, SciPy etc.). The good news for developers is, there will be plenty of time to adapt to these changes.

Wednesday, January 22, 2020

The Python plotting ecosystem

Python’s advance as a data processing language had been hampered by its lack of good quality chart visualization libraries, especially compared to R with its ggplot2 and Shiny packages. By any measure, ggplot2 is a superb package that can produce stunning, publication-quality charts. R’s advance has also been helped by Shiny, a package that enables users to build web apps from R, in effect allowing developers to create Business Intelligence (BI) apps. Beyond the analytics world, the D3 visualization library in JavaScript has had an impact on more than the JavaScript community; it provides an outstanding example of what you can do with graphics in the browser (if you get time check out some of the great D3 visualization examples). Compared to D3, ggplot2, and Shiny, Python’s visualization options still lag behind, though things have evolved in the last few years.

(An example Bokeh application. Multi-tabbed, widgets, chart.)

Matplotlib is the granddaddy of chart visualization in Python, it offers most of the functionality you might want and is available with almost every Python distribution. Unfortunately, its longevity is also its problem. Matplotlib was originally based on MATLAB’s charting features, which were in turn developed in the 1980’s. Matplotlib's longevity has left it with an awkward interface and some substantially out-of-date defaults. In recent years, the Matplotlib team has updated some of their visual defaults and offered new templates that make Matplotlib charts less old-fashioned, however, the library is still oriented towards non-interactive charts and its interface still leaves much to be desired.

Seaborn sits on top of Matplotlib and provides a much more up-to-date interface and visualization defaults. If all you need is a non-interactive plot, Seaborn may well be a good option; you can produce high-quality plots in a rational way and there are many good tutorials out there.

Plotly provides static chart visualizations too, but goes a step further and offers interactivity and the ability to build apps. There are some great examples of Plotly visualizations and apps on the web. However, Plotly is a paid-for solution; you can do most of what you want with the free tier, but you may run into cases where you need to purchase additional services or features.

Altair is another plotting library for Python based on the 'grammar of graphics’ concept and the Vega project. Altair has some good features, but in my view, it isn’t as complete as Bokeh for business analytics.

Bokeh is an ambitious attempt to offer D3 and ggplot2-like charts plus interactivity, with visualizations rendered in a browser, all in an open-source and free project. Interactivity here means having tools to zoom into a chart or move around in the chart, and it means the ability to create (browser-based) apps with widgets (like dropdown menus) similar to Shiny. It’s possible to create chart-based applications and deploy them via a web server, all within the Bokeh framework. The downside is, the library is under active development and therefore still changing; some applications I developed a year ago no longer work properly with the latest versions. Having said all that, Bokeh is robust enough for commercial use today, which is why I’ve chosen it for most of my visualization work.

Holoviews sits on top of Bokeh (and other plotting engines) and offers a higher-level interface to build charts using less coding.

It’s very apparent that the browser is becoming the default visualization vehicle for Python. This means I need to mention Flask, a web framework for Python. Although Bokeh has a lot of functionality for building apps, if you want to build a web application that has forms and other typical features of web applications, you should use Flask and embed your Bokeh charts within it.

If you’re confused by the various plotting options, you’re not alone. Making sense of the Python visualization ecosystem can be very hard, and it can be even harder to choose a visualization library. I looked at the various options and chose Bokeh because I work in business and it offered a more complete and reliable solution for my business needs. In a future blog post, I'll give my view of where things are going for Python visualization.