Inland shark attacks and police raids
I saw a post on LinkedIn that purported to show the frequency of shark attacks in the United States. There were little red dots along the coast line in all the areas you might expect. But there was one dot way inland, roughly in South Dakota, over 1,000 miles from the nearest ocean. People on Linked in were chiming in with explanations, but I knew why that dot was there. I knew it represented a warning that you need to know how your data's been processed if you're to understand it properly. I also knew why the same problem had led to a couple going through the living nightmare of multiple police raids for no reason. Let me tell you a couple of related stories of data processing problems and what it means for analysis, and of course, shark attacks in South Dakota.
Latitude and longitude of IP addresses
Company X (I'm withholding the real name of the company) collected and sold data on the geographic location of IP addresses. For example, it might determine that an IP address is located in Arlington, Virginia or Lyons, France. The company used a number of methods to determine location, but despite their best efforts, there were some IP addresses they could only resolve to the country level or to a rough geographic area.
The company's location data gave the latitude and longitude of the IP's geographic location. That's great for IPs it can nail down to a specific location, like Moscow, Russia, but what about IPs that it could only locate at a more granular level? In these cases, it gave the geographical center of the region or country. In the case of IPs it could only locate to the US, it assigned them the latitude and longitude of the geographical center of the US (in South Dakota) or the latitude and longitude of the geographical center of the contiguous US (in Kansas). That's where the trouble started.
Police raids
There's a lot of criminal activity on the internet. The police can sometimes track the IP addresses criminals use; if they can get a latitude and longitude for the IP address, they can locate the criminals. It sounds good doesn't it? It's a staple of TV shows where they show the bad guy tracked to a specific address. But, as you've probably guessed, there's a problem.
An innocent couple bought a farm near the geographic center of the contiguous US. This happened to correspond to the latitude and longitude Company X used for IP addresses it could only resolve to the contiguous US.
Various law enforcement agencies had the IP addresses criminals were using, and using Company X's data, they thought they'd found the location of the criminals. Of course, some of these IPs were only resolved at the country level, so the latitude and longitude pointed to the couple's farm. Law enforcement raided their farm numerous times (see for example https://www.bbc.com/news/technology-37048521), making their life a misery.
After the couple took legal action, Company X solved the problem by moving the center latitude and longitude to the middle of a lake and taking other steps you can read about online. They reached a settlement with the couple.
Huge IP clusters in the middle of nowhere
I have some personal experience of this geo-IP problem. I had geo-IP data I was analyzing and I was looking at the data for the UK. It showed a huge concentration of IP addresses in the north of England, in a forest. I knew this was wrong. After some digging, I found it was the same issue; the data supplier was geolocating IP addresses it couldn't precisely locate to the geographical center of the UK. I spotted the problem and changed my data processing to properly handle this kind of low-resolution data. I was running the project, so making changes was easy. Had I not known the UK though, I might have though there was a data center there or maybe theorized some other explanation.
Inland shark attacks
The internet shark attack data was geo data. You might expect that you would know for sure the location of a shark attack, but sometimes data is partially or badly entered. If shark attack data in the US is incompletely entered and the location isn't specified, then the most precise location will be the US (country-level location, nothing more), and the latitude and longitude of the attack will be the geographical center of the US, which is in South Dakota.
(LinkedIn image - I couldn't find the original or the copyright holder)
What surprised me about the LinkedIn "discussion" was the lack of knowledge of the possible cause of the problem. As soon as I saw the map I knew what the likely explanation was. The only surprise was the lack of more inland dots, for example, there should have been dots for the center of New York state etc. (representing attacks where only the state is known). Of course, it's impossible to investigate further because the original of the data was never given.
COVID data
A similar sort of data collection issue happened during COVID. In many places, coroner's offices are closed over the weekend. This means deaths might be registered as happening on Monday, when the death happened on a Friday night. Of course, there can also be reporting delays etc. (see https://ourworldindata.org/excess-mortality-covid). The net effect is, COVID death figures updated unevenly with corrections. People who don't normally watch or handle data didn't understand what was going on. Conspiracy theorists seized on this as evidence of malfeasance.
Malware data
It's not just sharks, police raids, and COVID. I was analyzing some malware data and I noticed some oddities. Attacks were bunched around certain times in a very weird pattern. I also had trouble matching the timestamp of attacks with other data sources I had on the same attacks. It turns out, the data collection process reassigned the timestamp and it was a processing timestamp not an observation timestamp, which was why the times were so weird; different attack data were being processed at the same time. The bottom line for me was, I couldn't trust the malware data because of the way it had been prepared.
Provenance, chain of custody, and false accuracy
You should never entirely trust your data. All kinds of processing issues can give strange results. You can end up trying to explain shark attacks in South Dakota rather than knowing you're looking at data entry errors or processing issues.
In the geo-IP case, location was given by country/region/town fields and latitude and longitude. If the region and/or town wasn't known, the data was null, so the lack of certainty was correctly coded. However, the latitude and longitude were still precisely given. This story points out the need to properly understand your data, how it's coded, and its level of accuracy.
I've seen similar issues with number of digits. People have assumed the time of an event is known very precisely because exact timestamps are given, but instead the timestamp is driven by the needs of the data processing or storage system. For example, the timing of a modern volcanic eruption might be known and stored in a database. The timestamp field might require a specific data or time. This is great for modern eruptions, but if the database is extended to include ancient eruptions, you either change the database format or give specific dates and times for eruptions thousands or millions of years ago. In these kinds of systems, the year is often given in units of 10, so 1600 CE, or 10,000 BCE, and the time is often given as 12:00 or 0:00. If you're doing analysis on volcanic data and you're doing data and time analysis, you could get some very weird results as a result.
If you're analyzing data, here's what you need to do:
- Understand where it comes from (provenance). Do you trust the supplier?
- Understand how it was collected and processed (chain of custody). Do the fields in the data mean what you think they mean?
- Check that the accuracy of the data is consistent across fields and across observations.
No comments:
Post a Comment