Why data is often bad and what you can do about it
In this blog post, I'm going to talk through what you can do to improve data quality. This covers three areas: automated issue detection, some techniques you can use to find errors, and most importantly, the people processes behind them.
Before we get going, it's important to note that bad data has a number of causes, some of which aren't under a company's control, e.g. ingesting 3rd party data. This means automated fixes aren't always possible. My belief is, even if you can't fix errors quickly, you need to know about them because your customers will.
Automate error detection
Automated error detection is the key to fixing data issues at a reasonable cost. The idea is, an automated system checks incoming data for a range of problems and flags errors or concerns. You can adapt these systems to give you error counts over time, the goal is measuring progress on reducing data issues, for example producing a daily data quality score.
The obvious objection is, if you can spot the errors, you should fix them in your data ingestion process so no need for an error detection system. Sadly, in the real world, things aren't so simple:
- If your data ingestion team was doing this already, there would be no data issues. The fact that there are errors tells you that you need to do something new.
- Ingestion systems focus on stability and handling known errors. Very rarely do they report on errors they can't fix. Frankly, for most dev teams, finding new data errors isn't a priority.
- The lead time to add new data quality checks to ingestion systems can be weeks or months. I've see people add new checks to standalone automated error checking systems in day.
If possible, an error detection system should integrate with a company's error ticket system, for example, automatically creating and assigning Jira tickets. This has some consequences as we'll see.
The people process
We can introduce as much error detection automation as we wish, but ultimately it's down to people to fix data issues. The biggest problem that occurs in practice is the split of responsibilities. In a company, it's often true that one team creates an automated system to find errors ('spotter' team) while another team is responsible for fixing them ('fixer' team). This sets up the potential for conflict right from the start. To win, you have to manage the issues.
- The 'spotter' team is creating more work for the 'fixer' team. The more productive the 'spotter' team is, the harder the 'fixer' team has to work.
- The 'spotter' team has to create meaningful error messages that the 'fixer' team has to be able to interpret. Frankly, technical people are often very bad at writing understandable error messages.
- For reasons we'll go into later, sometimes automated systems produce a tsunami of error messages, flooding the 'fixer' team.
- The 'fixer' team may have to work out of hours and resolve issues quickly, whereas the 'spotter' team works office hours and is under much less stress.
- The 'fixer' teams bears the consequences of any 'spotter' team failures.
- Goals (meaning, OKRs etc.) aren't aligned. One team may have an incentive to reduce errors, while a team they are reliant on does not.
I could go on, but I think you get the point.
Here's how I've approached this problem.
- I've made sure I know the complete process for resolving data issues. This means knowing who is responsible for fixing errors, how they do it, and the level of effort. It's important to know any external constraints, for example, if data is externally sourced it may take some time to resolve issues.
- I make the 'spotter' team and the 'fixer' team sit down together to discuss the project and make sure that they understand each other's goals. To be clear, it's not enough to get the managers talking, the people doing the work have to talk. Managers sometimes have other goals that get in the way.
- The 'spotter' team must realize that the 'fixer' team is their customer. That means error messages must be in plain English (and must give easily understood steps for resolution) and the system mustn't flood the 'fixer' team with errors. More generally, the 'spotter' team must adapt their system to the needs of the 'fixer' team.
- Everyone must understand that there will be teething problems.
- Where possible, I've aligned incentives (e.g. objectives, OKRs) to make sure everyone is focused on the end goal. If you can't align incentives, this may well sink your project.
As I've hinted, the biggest impediment to success is company culture as represented by people issues. I've run into issues where managers (and teams) have been completely resistant to error detection (even when the company has known error issues) and/or resistant to some checks. I'm going to be frank, if you can't get the 'fixer' team to buy in, the project won't work.
Simplicity, plain English, error levels, and flooding
It's important to start automated error detection with easy errors, like absent or impossible data. For example, a conversion rate is a number between 0 and 1, so a conversion rate of 1.2 is an error.
For each check, make sure the text of the error message uses the simplest language you can. Ideally, bring in the 'fixer' team to work on the text. Error messages should clearly explain the problem, give enough information to locate it, and where possible, give next steps.
You should prioritize issues found by the error detection system using an easy-to-understand scheme, for example:
- "FATAL" means an error will cause the system to fail in some way
- "ERROR" means the results of the system will be affected negatively
- "WARNING" means something isn't right and needs further investigation.
- "INFO" means this is FYI and you don't need to take action.
In reality, INFO type messages will be ignored and you may receive complaints for generating them. However, they're often a good way to introduce new checks; a new error check might start off at an "INFO" level while you make sure the 'fixer' team knows how to handle it, then it gets promoted to a "WARNING" to give the team time to adjust, finally becoming "ERROR" or "FATAL". The actual process you use is up to you, but you get the point.
Sometimes, a single problem can trigger multiple error tests. For example, imagine we're dealing with e-commerce data and we have a feed of sales and returns. The sales feed suffers a partial failure. Here's what might happen:
- A sales volume error might be triggered.
- There will be some returns unmatched with sales, so this may trigger another error.
- The return rate figure might spike (because we're missing sales, not missing returns).
- ...
So one failure might cause multiple error messages. This can flood the 'fixer' team with error messages without providing any helpful context. There are two things you have to do as a 'spotter' team:
- You can't flood 'fixer' teams with error messages. This is unhelpful and causes more confusion. They won't know where to start to fix the problem. You need to figure out how to meaningfully throttle messages.
- You have to provide higher level diagnosis and fixes. If multiple tests are triggered it may be because there's one cause. Where you can, consolidate messages and indicate next steps (e.g. "I'm seeing sales volume failures, unmatched return failures, and a return rate spike. This may be caused by missing sales data. The next step is to investigate if all sales data is present. Here are details of the failures I've found...")
This requirements have implications for how an automated error detection system is built.
I'm going to turn now to some math and some checks you should consider.
Floating checks
Let's say you're checking sales data from an automotive retailer. You know that sales go up and down over time and these fluctuations can be over the period of months or over the period of years. You want to detect sudden upward or downward spikes.
The simplest way of detecting anomalies like this is to use maximum and minimum threshold checks. The problem is, with business changes over time, you can end up falsely triggering errors or missing failures.
Let's imagine that sales are currently $1 million a day and you set an upper error detection threshold of $10 million and a lower threshold of $0.25 million. If you see sales numbers above $10 million or less than $0.25 million, you flag an error. As the company grows, it may reach $10 million naturally, falsely triggering an alert. On the flip side, if sales are usually $10 million, a drop to $2 million should trigger an alert, but with a $0.25 million threshold, it won't. The solution here is to use floating minimum and maximum values, for example, for any day, we look at the previous 10 days, work out a mean and set thresholds based on that mean (e.g. 2xmean, 0.5xmean), we then compare the day's sales to these floating thresholds. In reality, the process is more involved, but you get the point.
In practice, most error checks will use some form of floating check.
Deviations from expected distributions
This is more advanced topic, but you can sometimes use statistics to find when something is wrong. A couple of examples will help.
Let's imagine you're a large online retailer. You have occasional problems with your system falsely duplicating order data, for example, a customer buys a pen but sometimes the data shows it as two pens. The problem with deduplicating this data is that some customers will really buy two pens. Given this, how might you detect the presence of duplicate data in your system?
The answer lies in analyzing the distribution of your data.
Often, with this kind of data there's an expected distribution, let's say it's a Poisson distribution. In the absence of duplication, your order size distribution might look like this.
With 100% order duplication, it looks like this. Although the distributions look the same, if you look more closely you'll see there are no odd number values and the maximum value is twice what it was for no duplication.
With 25% order duplication, it looks like this. Note the characteristic zig-zag pattern.
This nice thing is, you don't even need to know what the "real" distribution should look like. All you need to detect is the zig-zag pattern introduced by duplication, or even the absence of odd number values. In fact, you can even attempt to quantify how much duplication is present.
Sometimes, you can use expected distributions more directly. Let's say you're maintaining data on company size as measured by employee size. You have data on the number of companies with different numbers of employees. Theoretically, this should be a power law distribution. When you plot the distribution, you see something like this (comparing theoretical (line) and actual (dots)).
This plot tells you you have some potential anomalies at 10, 100, 1,000 etc. with a huge outlier at 100,000. It's often the cases that data is based on estimates and that people use round numbers as estimates. The anomalies at 10, 100, 1,000 might be perfectly OK (you don't need to alert on everything you find), but the count of companies with 100,000 employees seems way off. This kind of extreme discrepancy from an expected distribution may well be worth alerting on.
Anomaly detection
It will probably come as no surprise to you to hear that data scientists have applied machine learning to spot anomalies, creating a sub-discipline of "anomaly detection" techniques. The most commonly used method is something called an "isolation forest" which is available from the popular scikit-learn library.
I'm going to suggest some caution here. This approach may take some time to develop and deploy. The model has to be trained and you have to be extremely careful about false positives. You also have to consider the action you want the 'fixer' team to take; it can't be "go look into this". For example, imagine a model that flags up something as anomaly. Without an explanation of why it's an anomaly, it's very difficult for a 'fixer' team to know what to do.
My suggestion is, detect obvious errors first, then develop a machine-learning based anomaly detector and see what it finds. It might be that you only run the anomaly detector on data that you think is clean.
Putting it all together
You can use error detection systems to find data errors in your system. Because these systems aren't tied to your production system, you can move very fast and add new data checks very quickly. You can also use error detection systems to create data quality metrics.
The main problem you'll face is people issues. These can be severe, so plan accordingly. Make sure goals are aligned and communication and trust are excellent.
Get started with an MVP using simple checks. Debug the people and technical process. Make sure people are resolving issues.
Add new checks as the system becomes more accepted. Makes sure your system never produces a tsunami of tickets and consolidate your findings where you can.
Statistical analysis can reveal errors that other forms of error check can't. Consider using these methods later on in the process.
Use advanced data science methods, like an isolation forest, sparingly and only when the rest of the system is up and running.