Saturday, July 26, 2025

Police and sharks: how not knowing how your data is processed can lead you badly astray

Inland shark attacks and police raids

I saw a post on LinkedIn that purported to show the frequency of shark attacks in the United States. There were little red dots along the coast line in all the areas you might expect. But there was one dot way inland, roughly in South Dakota, over 1,000 miles from the nearest ocean. People on Linked in were chiming in with explanations, but I knew why that dot was there. I knew it represented a warning that you need to know how your data's been processed if you're to understand it properly. I also knew why the same problem had led to a couple going through the living nightmare of multiple police raids for no reason. Let me tell you a couple of related stories of data processing problems and what they mean for analysis, and of course, shark attacks in South Dakota.

(Canva)

Latitude and longitude of IP addresses

Company X (I'm withholding the real name of the company) collected and sold data on the geographic location of IP addresses. For example, it might determine that an IP address is located in Arlington, Virginia or Lyons, France. The company used a number of methods to determine location, but despite their best efforts, there were some IP addresses they could only resolve to the country level or to a rough geographic area.

The company's location data gave the latitude and longitude of the IP's geographic location. That's great for IPs it can nail down to a specific location, like Moscow, Russia, but what about IPs that it could only locate at a more granular level? In these cases, it gave the geographical center of the region or country. In the case of IPs it could only locate to the US, it assigned them the latitude and longitude of the geographical center of the US (in South Dakota) or the latitude and longitude of the geographical center of the contiguous US (in Kansas). That's where the trouble started.

(Geographical center of the contiguous US - Wikipedia.)

Police raids

There's a lot of criminal activity on the internet. The police can sometimes track the IP addresses criminals use; if they can get a latitude and longitude for the IP address, they can locate the criminals. It sounds good doesn't it? It's a staple of TV shows where they show the bad guy tracked to a specific address. But, as you've probably guessed, there's a problem.

An innocent couple bought a farm near the geographic center of the contiguous US. This happened to correspond to the latitude and longitude Company X used for IP addresses it could only resolve to the contiguous US.

Various law enforcement agencies had the IP addresses criminals were using, and using Company X's data, they thought they'd found the location of the criminals. Of course, some of these IPs were only resolved at the country level, so the latitude and longitude pointed to the couple's farm. Law enforcement raided their farm numerous times (see for example https://www.bbc.com/news/technology-37048521), making their life a misery.

After the couple took legal action, Company X solved the problem by moving the center latitude and longitude to the middle of a lake and taking other steps you can read about online. They reached a settlement with the couple.

Huge IP clusters in the middle of nowhere

I have some personal experience of this geo-IP problem. I had geo-IP data I was analyzing and I was looking at the data for the UK. It showed a huge concentration of IP addresses in the north of England, in a forest. I knew this was wrong. After some digging, I found it was the same issue; the data supplier was geolocating IP addresses it couldn't precisely locate to the geographical center of the UK. I spotted the problem and changed my data processing to properly handle this kind of low-resolution data. I was running the project, so making changes was easy. Had I not known the UK though, I might have though there was a data center there or maybe theorized some other explanation.

Inland shark attacks

The internet shark attack data was geo data. You might expect that you would know for sure the location of a shark attack, but sometimes data is partially or badly entered. If shark attack data in the US is incompletely entered and the location isn't specified, then the most precise location will be the US (country-level location, nothing more), and the latitude and longitude of the attack will be the geographical center of the US, which is in South Dakota.

(LinkedIn image - I couldn't find the original image or the copyright holder)

I've replicated the LinkedIn image above. Note that the original poster(s) didn't link back to the origin of the chart or of the data set, so it's impossible to do any checking. The data set is supposedly shark attacks in the US, but other countries are shown too, except Canada for some reason. This "USA" chart leaves off Alaska and Hawaii, and I'm sure there are shark attacks in Hawaii too. I did my own sleuthing, but I couldn't find the origin of the chart.

There were a couple of notable features about the LinkedIn discussion:

No one challenged the poster on the origin, consistency, or validity of the data set and chart.
The discussion about causes was very ill-informed, with almost no one suggesting data collection issues as the cause.

The minute I saw this chart, I knew data collection and entry was the likely cause. The only surprise for me was the lack of more inland dots, for example, there should have been dots for the center of New York state etc. representing attacks where only the state is known.

COVID data

A similar sort of data collection issue happened during COVID. In many places, coroner's offices are closed over the weekend. This means deaths might be registered as happening on Monday, when the death happened on a Friday night. Of course, there can also be reporting delays etc. (see https://ourworldindata.org/excess-mortality-covid). The net effect is, COVID death figures updated unevenly with corrections. People who don't normally watch or handle data didn't understand what was going on. Conspiracy theorists seized on this as evidence of malfeasance.

Malware data

It's not just sharks, police raids, and COVID. I was analyzing some malware data and I noticed some oddities. Attacks were bunched around certain times in a very weird pattern. I also had trouble matching the timestamp of attacks with other data sources I had on the same attacks. It turns out, the data collection process reassigned the timestamp and it was a processing timestamp not an observation timestamp, which was why the times were so weird; different attack data were being processed at the same time. The bottom line for me was, I couldn't trust the malware data because of the way it had been prepared.

Provenance, chain of custody, and false accuracy

You should never entirely trust your data. All kinds of processing issues can give strange results. You can end up trying to explain shark attacks in South Dakota rather than knowing you're looking at data entry errors or processing issues.

In the geo-IP case, location was given by country/region/town fields and latitude and longitude. If the region and/or town wasn't known, the data was null, so the lack of certainty was correctly coded. However, the latitude and longitude were still precisely given. This story points out the need to properly understand your data, how it's coded, and its level of accuracy.

I've seen similar issues with number of digits. People have assumed the time of an event is known very precisely because exact timestamps are given, but instead the timestamp is driven by the needs of the data processing or storage system. For example, the timing of a modern volcanic eruption might be known and stored in a database. The timestamp field might require a specific data or time. This is great for modern eruptions, but if the database is extended to include ancient eruptions, you either change the database format or give specific dates and times for eruptions thousands or millions of years ago. In these kinds of systems, the year is often given in units of 10, so 1600 CE, or 10,000 BCE, and the time is often given as 12:00 or 0:00. If you're doing analysis on volcanic data and you're doing data and time analysis, you could get some very weird results as a result.

If you're analyzing data, here's what you need to do:

Understand where it comes from (provenance). Do you trust the supplier?
Understand how it was collected and processed (chain of custody).

Do the fields in the data mean what you think they mean?
Do data storage requirements affect the data?

Check that the accuracy of the data is consistent across fields and across observations.

Know your data!

Tuesday, July 22, 2025

Visualizing multi-dimensional data: score distributions in English football

Sometimes, visualization is very, very difficult

I want to visualize the frequency of scores in English league football from 1888 to today so I can see if there are any discernable patterns, e.g., is a score of 2-1 becoming more common than a score of 1-2? Trouble is, I’ve run into problems figuring out how to visualize my data in a way that provides insight. This blog post is all about the things I’ve tried and what I've discovered about score distributions. If you’re trying to visualize complex data, this might be helpful.

The dataset is scores and their frequency for English league football games by season and by league, so it looks like this:

Season	League tier	Home goals	Away goals	Frequency
1888-1889	1	0	0	0.015152
1888-1889	1	0	1	0.007576
1888-1889	1	0	2	0.037879
1888-1889	1	0	3	0.015152
1888-1889	1	0	4	0.022727
...

I want investigate if there are differences by league and changes over time. The obvious problem is, this data set has five dimensions.

3D bar charts

For one league and one season, I can plot a 3D bar chart. Here are two examples for the top tier in the 1888-1889 (left) and 2024-2025 seasons (right) (in 2025 it was called the Premier League, in 1888 it was the only tier). There’s plainly a big change in score distributions, so how can we investigate it over time?

Although these 3D bar charts have some use, they suffers from a number of problems:

The chart perspective means some scores can be hidden by others; you can change the viewpoint to avoid blockages, but it’s not a good solution.
I can’t show multiple charts on the same diagram because the 3D nature of the plot makes things confusing. Comparing two charts side by side means you can only pick out gross differences.
Finally, there are over 100 seasons for the top tier and there are currently five tiers, which is an awful lot of charts to compare.

Bottom line: 3D bar charts don’t fit the bill.

3D animated bar charts

I can animate the 3D bar chart so I can see changes over time. This doesn’t let me see different leagues on the same chart, but I can show leagues in separate charts. The two charts below show how scoring evolves for the top tier (currently called the Premier League) from 1888, and the lowest tier (currently called the National League) from 1979.

The animations show two different things. For the Premier League, large numbers of goals have become less frequent, and it’s much more common now to see low scoring games, e.g., 0-0 or 2-1, compared to high scoring games, e.g., 5-4, or 6-0. For both leagues, there’s notable variation season to season.

Other than that, it’s slim pickings. Animated 3D charts still suffer from many of the same problems as non-animated 3D charts. Time to try something else.

Heatmaps

Here are two heatmaps for two seasons for the top tier.

I find this easier to analyze than the 3D bar charts because bars aren't blocking the view of other bars. On the downside, it’s hard to resolve fine differences by color and of course, it’s hard to see multiple leagues on the same chart. There are over 100 season’s worth of data, and I don’t want to look at over 100 charts.

Animated heatmaps

To investigate changes over time I animated the heatmap for the top league and the fifth league

For my money, this is a better representation than the animated 3D bar chart. It’s clear that scores have become concentrated around 1-1 since 1888, but there are exceptions, big scores still happen.

Of course, animations are nice, but maybe a slider control would be a bit better?. I decided to try out a different plotting package to see if a slider makes things easier. On the chart below, click on the slider and move it around to see the changes over time.

The problem with heatmaps is it can be hard to see smallish changes, and although the animation and sliders are helpful, they don't really let you see changes over time very clearly.

Line chart grids

One obvious data representation is to draw a grid, with the axis representing home and away goals. Each box in the grid represents a score. In each box in the grid, put a line chart showing the change in frequency over time for that score. That’s what I’ve done in the chart below.

All of these charts have the same x-axis and y-axis scale, the x-axis is the season from 1888 to the present, and the y-axis is frequency. Some scores don't occur at all in the data set, so the chart is empty. I've labeled each of the charts with the score they represent. Here's a zoom in on the 1-1 score chart.

On this chart, the x-axis is the year and the y-axis is the frequency. To keep clutter down, I haven't labeled the axis.

The big grid chart is very squished-up here on this blog post, but on my big screen, I can make sense of it. I can see some trends:

‘Extreme’ scores are rare and there’s been little change since 1888, but they do occur.
Low scores (e.g., 5 or fewer goals per match) are much more common than extreme scores, and that’s been the way mostly since 1888.
Home advantage exists, but it’s getting smaller.
All of the 'action' is in the bottom left corner.

The obvious downside is how incredibly busy the big chart is. It’s helpful to just look at a subset (the bottom left) as I’ve shown below.

It would be possible to show multiple leagues on this type of plot. Maybe not on the full grid, but certainly on a subset.

Where I’ve ended up

Reviewing all of these charts, here are my takeaways about chart choice for multi-dimensional data.

The best approach is the line chart grids. The downside is, the grid can get huge, but the upside is, you get information on where to focus your analysis. Notably, the level of coding effort for the line chart grid is the lowest of all the other methods I used.
Heatmap animations are helpful, but they’re really more something that looks good rather than gives you a lot of insight. Heatmaps, whether animated or not, are better than 3D bar charts.
3D bar charts look pretty, but they’re not very useful.

And what did I discover about scores? The frequency of different scores hasn’t changed much over time. Low scores are much more frequent than high scores, but high scoring games still occur. In the top tier, there’s a noticeable drop in home wins.

The role of AI code generation

I used AI code generation to help with this investigation. I used Cursor as my editor, and used code completion extensively. Notably, Cursor did really badly with generating animations, so I used Claude to generate example code to get me started. Claude gave me a good starting place, but I had to extensively modify what it gave me. Code generation for some of the more complex Pandas operations wasn’t good in Cursor.

I found getting started with an example to be far more helpful than using documentation or using StackOverflow, in fact, a big waste of time for me was trying to get examples sourced from websites to work. Once I told Claude what I wanted, things went much more quickly.

The bottom line is, AI was a very helpful tool, but not magic fairy dust.

Engora Data Blog