Wednesday, February 5, 2020

Reverse engineering a sensor

A few years ago, I was working at an excellent company (BitSight). The only problem was, I felt the office was either too hot or too cold. I bought a temperature/humidity sensor and reverse engineered the control system to build my own UI to control the device. Here's the story of how I did it. If you want to learn how to reverse engineer a sensor interface, this story might be useful to you.

(Image credit: Uni-Trend)

I bought a UT330B temperature and humidity sensor from AliExpress for $35. This is a battery-powered temperature and humidity data logger that can record 60,000 measurements of temperature and humidity. The device has a USB port to download the recordings to a computer and it came with control software to read the device data via USB. The price was right, the accuracy was good, but there was a problem. The control software for the device worked on Windows only. My work and home computers were all Macs. So what could I do?

My great colleague, Philip Gladstone, suggested the way forward. He said, why not reverse engineer the commands and data going back and forth over the USB port so you can write your own control software? So this is what we did. (Philip got the project started, did most of the USB monitoring, and helped with deciphering commands.)

The first step was monitoring the USB connection. We got hold of a Windows machine and ran some USB monitoring software on it. This software reported the bytes sent back and forth to the UT330B. The next step was to load up and run the UT330B control software. Here's the big step; we used the control software to get the device to do things and monitored the USB traffic. From that traffic, we were able to reverse engineer the commands and responses.

The first thing we found was that every command to the device had a particular form:
[two byte header] [two byte command] [two byte modbus CRC]
e.g
0xab 0xcd 0x03 0x18 0xb1 0x05
(this is the command to delete data from the device)
we were able to infer how the command bytes related to actual commands (e.g. 0x03 0x18 means delete data). In some cases, we found pressing one button on the control software resulted in several commands to the device.

The next step was figuring out the CRC. Given the data we had and our knowledge that the CRC was two bytes, I set about looking for CRC schemes that would give us the observed data. I found the UT330B used a modbus CRC check.

Now I could replicate every command to the UT330B.

Figuring out the device data format came next. The device data was what the UT330B sent back in response to commands. Some of this was straightforward, e.g. dates and times, but some of it was more complex. To get a full range of device temperatures, I put the UT330B in a freezer (to get below 0c) and held it over a toaster to get over 100c. Humidity data was a little harder, I managed to get very high humidities by holding the device in the output of a kettle but I had to wait for very dry days to get low humidities. It turns out, data was coded using two's complement (this caused me some problems with negative temperatures until I realized what was going on).

So now I knew every command and how to interpret the results.

Writing some Python code was the next order of business. To talk to the UT330B via the USB port, we needed to use the PySerial library. The UT330B used a Silicon Labs chip, so we also needed to download and install the correct software. With some trial and error, I was able to write a Python class that sent commands to the UT330B and read the response. Now I was in complete control of my device and I didn't need the Windows command software.

A class is fine, but the original command software had a nice UI where you could plot charts of temperature and humidity. I wanted something like that in Python, so I used the Bokeh library to build an app. Because I wanted a full-blown app that was written properly, I put some time into figuring out the software architecture. I decided on a variant of the model-view-controller architecture and built my app with a multi-tab display, widgets, and of course, a figure.

Here are some screenshots of what I ended up with.

The complete project is on Github here.

So what did I learn from all of this?

It's possible to reverse engineer (some) USB-controlled devices using the appropriate software, but it takes some effort and you have to have enough experience to do it.
You can control hardware using Python.
Bokeh is a great library for writing GUIs that are focused on charts.
If I were ever to release a USB hardware device, I would publish the interface specification (e.g. 0x03 0x18 means delete data) and provide example (Python or C) code to control the device. I wouldn't limit my sales to Windows users only.

Saturday, February 1, 2020

Company culture and the hall of mirrors

Company culture: real and distorted

When I was a child, my parents took me to a funfair and we walked through the hall of mirrors. The mirrors distorted reality, making us look taller or smaller, fatter or thinner than we really were. I've read a number of popular business books that extoll the virtues of company culture, but I couldn't help feeling they were a hall of mirrors; distorting what's really there to achieve higher book sales.

(Reading about company culture in popular business books is like entering the house of mirrors. Image credit: Wikimedia Commons, License: Creative Commons, Author: SJu)

I was fortunate enough to spend time researching the academic literature on company culture to understand what's really there, and the results surprised me. I took a course at the Harvard Extension School on Organizational Behavior which was taught by the excellent Carmine Gibaldi. Part of this course was a research project and I chose to look at the relationship between company culture and financial performance.

What books claim vs. evidence

Given the long list of popular business books making very definitive claims about company culture, you might expect the research literature to show some clear results. The reality is, the literature doesn't. In fact, when I investigated the support for claims made in books, I couldn't find much in the way of corroborating evidence. Going back to the popular books, I found no references in many cases, in other cases I found evidence based on the author's work alone, and in other cases, the evidence was just unverified anecdotes. What I did find in the research literature was a more complex story than I expected.

Company culture and financial performance

Despite what the popular business books said, it wasn't true that a strong company culture of itself led to financial success, in fact, many companies with strong cultures went bankrupt. It wasn't true that there were a magic set of values that led to company success either, different successful companies had different values and different cultures. It wasn't true that company values are immutable, even a large company like IBM could and did change its values to survive.

What I found is that there are mechanisms by which a strong and unified company culture can lead to better performance; coordination and control, goal alignment, and enhanced employee effort. For these reasons, under stable market conditions, a strong culture does appear to offer competitive advantages. However, under changing market conditions, a strong culture may be a disadvantage because strong cultures don't have the variability necessary to adapt to change. There also seemed to be a link between culture and industry, with some cultures better suited to some industries.

I read about DEC, Arthur Anderson, and Enron. All companies with strong cultures and all companies that had been held up at the time as examples of the benefits of a strong culture. DEC's culture had been studied, extensively written about, and emulated by others. Enron had several glowing case studies written about them by adoring business schools. In all these cases, company culture may well have been a factor in their demise. Would you take DEC or Enron as your role model now? Are the companies you're viewing as your role models now going to be around in five or ten years?

Where's the beef?

Am I falling into the trap of not providing you with references for my assertions? No, because I'm going to go one stage further. I'm going to give you a link to what I wrote that includes all my references and all my arguments in more detail. Here's the report I wrote which lays out my findings - you can judge for yourself.

Popular books are mostly wrong

During my research, I read some detailed critiques of popular business books. I must admit, I was shocked at the extent to which the Emperor really has no clothes and the extent of the post hoc fallacy and survivor bias in the popular books. Many of the great companies touted in these books have fallen from grace, disproving many of the assertions the books made. If you make statements that company culture leads to long-lasting success, and your champions fail while maintaining their values, it kind of looks like your arguments aren't great. Given the lack of research support for many business books' claims, this should come as no surprise.

What to do?

Here's my advice: any popular business book that purports to reveal a new underlying truth about business success is probably wrong. Before you start believing the claims, or extolling the book's virtues, or implementing policy changes based on what you've read, check that the book stands up to scrutiny and ideally, that there's independent corroboration.

Tuesday, January 28, 2020

Future directions for Python visualization software

The Python charting ecosystem is highly fragmented and still lags behind R, it also lacks some of the features of paid-for BI tools like Tableau or Qlik. However, things are slowly changing and the situation may be much better in a few years' time.

(Image credit: Alvesgaspar - Wikimedia Commons - license)

Theoretically, the ‘grammar of graphics’ approach has been a substantial influence on visualization software. The concept was introduced in 1999 by Leland Wilkinson in a landmark book and gained widespread attention through Hadley Wickham’s development of ggplot2 The core idea is that a visualization can be represented as different layers within a framework, with rules governing the relationship between layers.

Bokeh was influenced by the 'grammar of graphics' concept as were other Python charting libraries. The Vega project seeks to take the idea of the grammar of graphics further and creates a grammar to specify visualizations independent of the visualization backend module. Building on Vega, the Altair project is a visualization library that offers a different approach from Bokeh to build charts. It’s clear that the grammar of graphics approach has become central to Python charting software.

If the legion of charting libraries is a negative, the fact that they are (mostly) built on the same ideas offers some hope for the future. There’s a movement to convergence by providing an abstraction layer above the individual libraries like Bokeh or Matplotlib. In the Python world, there’s precedence for this; the database API provides an abstraction layer above the various Python database libraries. Currently, the Panel project and HoloViews are offering abstraction layers for visualization, though there are discussions of a more unified approach.

My take is, the Python world is suffering from having a confusing array of charting library choices which splits the available open-source development efforts across too many projects, and of course, it confuses users. The effort to provide higher-level abstractions is a good idea and will probably result in fewer underlying charting libraries, however, stable and reliable abstraction libraries are probably a few years off. If you have to produce results today, you’re left with choosing a library now.

The big gap between Python and BI tools like Tableau and Qlik is the ease of deployment and speed of development. BI tools reduce the skill level to build apps, deploy them to servers, and manage tasks like access control. Projects like Holoviews may evolve to make chart building easier, but there are still no good, easy, and automated deployment solutions. However, some of the component parts for easier deployment exist, for example, Docker, and it’s not hard to imagine the open-source community moving its attention to deployment and management once the various widget and charting issues of visualization have been solved.

Will the Python ecosystem evolve to be as good as R’s and be good enough to take on BI tools? Probably, but not for a few years. In my view, this evolution will happen slowly and in public (e.g. talks at PyCon, SciPy etc.). The good news for developers is, there will be plenty of time to adapt to these changes.

Saturday, January 25, 2020

Clayton Christensen, RIP

I was saddened today to read of the death of Clayton Christensen. For me, he was one of the great business thinkers of the last 100 years.

(Image credit: Remy Steinegger - Wikimedia Commons - Creative Commons License)

I remember finding his wonderful book, "The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail", on the shelf at City University in London and devouring it on the way home on the tube. I only recommend a handful of business books and Christensen's book is number one on my list.
Christensen was fascinated by technology disruption, especially how industry leaders are caught out by change and why they don't respond in time. The two big examples in his book are steam shovels and disk drives.

The hydraulic shovel was a cheap and cheerful innovation that didn't seem to threaten the incumbent steam shovel makers. If anything, it was good for their profitability because hydraulic technology addressed the low end of the market where margins were poor, leaving the high-end more profitable market niches to steam shovels. Unfortunately for the steam shovel makers, the hydraulic shovel makers kept on innovating and pushed their technology into more and more niches. In the end, the steam shovel makers were relegated to just a few small niches and it was too late to respond.

The disk drive industry is Christensen's most powerful example. There have been successive waves of innovation in this industry as drives increased in capacity and reduced in size. The leaders in one wave were mostly not the leaders in subsequent waves. Christensen's insight was, the same factors were at work in the disk drive industry as in the steam shovel industry. Incumbents wanted to maximize profitability and saw new technologies coming in at the bottom end of the market where margins were low. Innovations were dismissed as being low-end technologies not competing with their more profitable business. They didn't respond in time as the disruptive technologies increased their capabilities and started to compete with them.

Based on these examples, Christensen teases out why incumbents didn't respond in time and what companies can do to not be caught out by these kinds of disruptive innovations.

Company culture is obviously a factor here, and Christensen poked into this more with a very accessible 2000 paper where he and Overdorf discussed the factors that can lead to a company culture that ignores disruptive innovation.

I never met Christensen, but I heard a lot of good things about him as a person. My condolences to his family.

Further reading

"The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail" - Clayton Christensen

"Meeting the Challenge of Disruptive Change", Clayton M. Christensen and Michael Overdorf, Harvard Business Review, https://hbr.org/2000/03/meeting-the-challenge-of-disruptive-change

How to lie with statistics

I recently re-read Darrell Huff's classic text from 1954, 'How to lie with statistics'. In case you haven't read it, the book takes a number of deceitful statistical tricks of the trade and explains how they work and how to defend yourself from being hoodwinked. My overwhelming thought was 'plus ça change'; the more things change, the more they remain the same. The statistical tricks people used to mislead 50 years ago are still being used today.

(Image credit: Wikipedia)

Huff discusses surveys and how very common methodology flaws can produce completely misleading results. His discussion of sampling methodologies and the problems with them are clear and unfortunately, still relevant. Making your sample representative is a perennial problem as the polling for the 2016 Presidential election showed. Years ago, I was a market researcher conducting interviews on the street and Huff's bias comments rang very true with me - I faced these problems on a daily basis. In my experience, even people with a very good statistical education aren't aware of survey flaws and sources of bias.

The chapter on averages still holds up. Huff shows how the mean can be distorted and why the median might be a better choice. I've interviewed people with Master's degrees in statistics who couldn't explain why the median might be a better choice of average than the mean, so I guess there's still a need for the lesson.

One area where I think things have moved in the right direction is the decreasing use of some types of misleading charts. Huff discusses the use of images to convey quantitative information. He shows a chart where steel production was represented by images of a blast furnace (see below). The increase in production was 50%, but because the height and width were both increased, the area consumed by the images increases by 150%, giving the overall impression of a 150% increase in production¹. I used to see a lot of these types of image-based charts, but their use has declined over the years. It would be nice to think Huff had some effect.

(Image credit: How to lie with statistics)

Staying with charts, his discussion about selecting axis ranges to mislead still holds true and there are numerous examples of people using this technique to mislead every day. I might write a blog post about this at some point.

He has chapters on the post hoc fallacy (confusing correlation and causation) and has a nice explanation of how percentages are regularly mishandled. His discussion of general statistical deceitfulness is clear and still relevant.

Unfortunately, the book hasn't aged very well in other aspects. 2020 readers will find his language sexist, the jokey drawings of a smoking baby are jarring, and his roundabout discussion of the Kinsey Reports feels odd. Even the writing style is out of date.

Huff himself is tainted; he was funded by the tobacco industry to speak out against smoking as a cause of cancer. He even wrote a follow-up book, How to lie with smoking statistics to debunk anti-smoking data. Unfortunately, his source of authority was the widespread success of How to lie with statistics. How to lie with smoking statistics isn't available commercially anymore, but you can read about it on Alex Reinhart's page.

Despite all its flaws, I recommend you read this book. It's a quick read and it'll give you a grounding in many of the problems of statistical analysis. If you're a business person, I strongly recommend it - its lessons about cautiously interpreting analysis still hold.

This is a flawed book by a flawed author but it still has a lot of value. I couldn't help thinking that the time is probably right for a new popular book on how people are lying and misleading you using charts and statistics.

Correction

[1] Colin Warwick pointed out an error in my original text. My original text stated the height and width of the second chart increased by 50%. That's not quite what Huff said. I've corrected my post.

Wednesday, January 22, 2020

The Python plotting ecosystem

Python’s advance as a data processing language had been hampered by its lack of good quality chart visualization libraries, especially compared to R with its ggplot2 and Shiny packages. By any measure, ggplot2 is a superb package that can produce stunning, publication-quality charts. R’s advance has also been helped by Shiny, a package that enables users to build web apps from R, in effect allowing developers to create Business Intelligence (BI) apps. Beyond the analytics world, the D3 visualization library in JavaScript has had an impact on more than the JavaScript community; it provides an outstanding example of what you can do with graphics in the browser (if you get time check out some of the great D3 visualization examples). Compared to D3, ggplot2, and Shiny, Python’s visualization options still lag behind, though things have evolved in the last few years.

(An example Bokeh application. Multi-tabbed, widgets, chart.)

Matplotlib is the granddaddy of chart visualization in Python, it offers most of the functionality you might want and is available with almost every Python distribution. Unfortunately, its longevity is also its problem. Matplotlib was originally based on MATLAB’s charting features, which were in turn developed in the 1980’s. Matplotlib's longevity has left it with an awkward interface and some substantially out-of-date defaults. In recent years, the Matplotlib team has updated some of their visual defaults and offered new templates that make Matplotlib charts less old-fashioned, however, the library is still oriented towards non-interactive charts and its interface still leaves much to be desired.

Seaborn sits on top of Matplotlib and provides a much more up-to-date interface and visualization defaults. If all you need is a non-interactive plot, Seaborn may well be a good option; you can produce high-quality plots in a rational way and there are many good tutorials out there.

Plotly provides static chart visualizations too, but goes a step further and offers interactivity and the ability to build apps. There are some great examples of Plotly visualizations and apps on the web. However, Plotly is a paid-for solution; you can do most of what you want with the free tier, but you may run into cases where you need to purchase additional services or features.

Altair is another plotting library for Python based on the 'grammar of graphics’ concept and the Vega project. Altair has some good features, but in my view, it isn’t as complete as Bokeh for business analytics.

Bokeh is an ambitious attempt to offer D3 and ggplot2-like charts plus interactivity, with visualizations rendered in a browser, all in an open-source and free project. Interactivity here means having tools to zoom into a chart or move around in the chart, and it means the ability to create (browser-based) apps with widgets (like dropdown menus) similar to Shiny. It’s possible to create chart-based applications and deploy them via a web server, all within the Bokeh framework. The downside is, the library is under active development and therefore still changing; some applications I developed a year ago no longer work properly with the latest versions. Having said all that, Bokeh is robust enough for commercial use today, which is why I’ve chosen it for most of my visualization work.

Holoviews sits on top of Bokeh (and other plotting engines) and offers a higher-level interface to build charts using less coding.

It’s very apparent that the browser is becoming the default visualization vehicle for Python. This means I need to mention Flask, a web framework for Python. Although Bokeh has a lot of functionality for building apps, if you want to build a web application that has forms and other typical features of web applications, you should use Flask and embed your Bokeh charts within it.

If you’re confused by the various plotting options, you’re not alone. Making sense of the Python visualization ecosystem can be very hard, and it can be even harder to choose a visualization library. I looked at the various options and chose Bokeh because I work in business and it offered a more complete and reliable solution for my business needs. In a future blog post, I'll give my view of where things are going for Python visualization.

Saturday, January 18, 2020

How to be a good candidate for analytics and data science jobs

Over the last few years, I’ve done a lot of hiring across many disciplines: analytics, data science, product management, engineering, sales, and HR. I’ve learned a lot about what makes a good candidate and what makes a bad candidate. Because I’m writing this blog for people interested in analytics and data science, I'm going to share some of the things that I think are likely to improve your chances of getting hired for technical positions.

(Image credit: Grey Geezer at Wikimedia Commons - license. Image unchanged.)

Hiring is risky

The key thing to remember is hiring is a tremendously risky process for the employer. It’s very painful to unwind a poor hiring decision, so for the most part, the interview team is not inclined to take risks. You have to satisfy the technical requirements for the job, but also the social requirements too. The interview team will be deciding whether or not you’re a fit for the team - can they work with you? There are all kinds of clues they use to decide this and I’ll cover some of them here.

Resume blunders

Candidates make amazing blunders with resumes. I’ve seen odd layouts, poor wording, and incredibly long resumes (15+ pages in one case). Here are some simple rules:

Length: one page if you’re junior, two pages (at most) if you’re senior.
Layout: single-column layouts - keep it simple.
Keywords: your resume should use every relevant keyword as many times as it makes sense. For example, if you have machine learning experience, use the term. Resumes are often keyword screened and if you don’t have the keywords you’ll be ruled out by an algorithm.
Contact details: name, city, phone number, email. I always give local candidates preference, but I have to know you’re local.

Your resume gives clues to how well-prepared you are (back to the risk thing), a bad resume indicates you haven’t taken advice, or you don’t care, or you’re naive, none of which are good. There are plenty of good resources out there for building resumes. Northeastern University does an incredible job preparing its candidates for work, including some great coaching on resume building. They have an excellent website on resumes with lots of strong guidance.

One great piece of advice I’ve heard is to customize your resume for the employer or industry you’re targeting. Some candidates are considering different employment areas but they have a single resume they’re trying to use for everything. You should have a different resume focusing on different areas for each industry you're targeting. If you have time, you should tweak your resume for each employer. Remember that customization is as much about what you leave out as what you leave in. For example, if you have wet bench experience but you’re applying for computing positions, you should shorten (or remove) your wet bench sections and increase the length of your software sections. The logic here is simple, you have limited space, so why tell an employer about something irrelevant to them? For me, there’s a minor exception - I do like candidates with something unusual about them, but a single resume line is usually enough (e.g. ‘wet bench qualifications’, ‘EMT qualified’).

Github!

I love it when candidates have a Github page they put on their resume. If they pass the screening interview, I check out their page and what they’ve done. You do need to be careful though, I’ve seen some bad code that’s put me off a candidate. Github is especially great if you’re trying to do some kind of career transition into analytics or data science from some other field. If you’re transitioning, you can’t talk about what you’ve done in your current role as proof of your capabilities, but you can talk about the Github projects you’ve created in your own time. In fact, creating a project in your own time to display your work shows a tremendous amount of commitment. If you have projects to put up on Github, do so, it’s a great place to demonstrate your talent.

Be prepared - and turn your camera on

If you haven’t interviewed in a while, it’s a good idea to reach out to your connections and ask for a practice interview. You could also ask your friends for a review of your resume. Of course, you should remember that if people help you, in turn, you should help people.

For heaven’s sake, be technically prepared for the interview. Nowadays, many interviews are conducted via a computer video call (e.g. Skype, Zoom, etc.). There’s almost always software to download and install. Make sure you have the software installed and running before the call. I interviewed someone for a management position who took 20 minutes to download the software and get into the interview. Not good when you’re interviewing for a position that requires experience and forward planning!

For video interviews, I have two pieces of advice: turn your camera on and consider where you do the interview. It’s a video interview for a reason and it looks odd if you don’t turn your camera on. I was once told that the reason why a candidate didn’t turn their camera on was that they’d had an unusual hair treatment just before the call. Your hair is your business, but why not schedule the call for some other time? You also need to consider your background; what will the interviewer see? I was once interviewed by someone from a hotel bedroom with their underwear strewn everywhere in the frame - it didn’t create a professional impression. One candidate I interviewed had their laptop on their knees for the interview; every time they moved the entire video frame heaved like a ship in a storm and by the end of the interview I felt seasick. Try to avoid distracting locations and distracting items in the frame.

Preparation also means understanding who will interview you and what the interview will cover. I’ve interviewed candidates who were surprised to be asked technical questions when the interview briefing clearly said that would happen.

Of course, you must look up everyone on LinkedIn beforehand and know their roles - you might even get insight into the questions they might ask.

Examples and questions

A few years ago I did a course on behavior-based interviewing. There were lots of great pieces in the course but it can be boiled down to one simple idea: give examples for everything you claim. For example, if you claim to be a good planner, give examples of how you planned well, if you claim to know Python, point to examples (e.g. Github), and so on. The idea is you’re providing proof - doing is better than saying.

Make sure you have plenty of questions for each interviewer. It shows you’re prepared and engaged, and of course, you might learn something useful. It’s also expected. If you can, get every interviewer’s email address, you’ll need it later.

At the end

When it’s all over, send a thank you email to everyone who interviewed you. For any kind of customer-facing role, this is expected and it’s increasingly expected for technical roles too.

If you don’t get the job, there’s one last thing you can do. If you got on well with the interview team, ask for feedback. Not every interview team will do it, but some will and you can learn a lot from them about why you didn’t get the job.

Final thoughts

Bear in mind that a lot of what I’ve said is about reducing risk for the employer in choosing you. Being prepared for the interview (software download, video call background, interview questions, etc.) shows you take it all seriously and gives clues to what you’ll be like as an employee. Asking questions at the interview and thanking everyone shows you know about social conventions and could be a good fit for the team.

Good luck!