Monday, November 8, 2021

Football crazy: predicting Premier League football match results

I can get a qualification and be rich!

A long time ago, I was part of a gambling syndicate. A friend of mine had some software that predicted the results of English football (soccer) matches and at the time, betting companies offered fixed-price odds for certain types of bets. My friend noticed his software predicted 3-2 away wins more often than the betting company's odds would suggest. Over the course of a season, we had a 20% return on our gambling investment. 

During the COVID lockdown, I took the opportunity to learn R and did a long course that included a capstone project. I decided to see if I could forecast English Premier League (EPL) matches. If I succeeded, I could get a qualification and get rich too! What's not to like? Here's the story of what I did and what happened.

Premier League data

There's an eighteenth-century recipe for a hare dish that supposedly includes the instructions "First, catch your hare." The first step in any project like this is getting your data.

I got match results going back to the start of the league (1993) from football-data. The early data is only match results, but later data includes red cards and some other measurements.

TransferMarkt has data on transfer fees, foreign-born players, and team age, but the data's only available from 2011.

At the time of the project, I couldn't find any other free data sources. There were and are paid-for sources, but they were way beyond what I was willing to pay.

I knew going into the next phase of the project that this wasn't a very big data set with not that many fields. As it turned out, data was a severely limiting factor.

What factors are important?

I had a set of initial hypotheses for factors that might be important for final match scores, here are most of them:

  • team cost - more expensive teams should win more games
  • team age - younger teams perform better
  • prior points - teams with more points win against teams with fewer points
  • foreign-born players - the more non-English players on the team, the more the team will win
  • previous match results - successful (winning) teams win more matches
  • home-field advantage
  • disciplinary record - red and yellow card history might be an indicator of risk-taking
  • season effects - as the season wears on, teams take more risks to win matches

I found evidence that most of these did in fact have an impact.

Here's strong evidence of home-field advantage. Note how it goes away during the 2020-2021 season when matches were played without fans.

Here's goal difference vs. team cost difference. The more expensive team tends to win.

Here's goal difference vs. mean prior goal difference. Teams that scored more goals before tend to score more goals in their current match.

I found more relationships you can read about if you're interested.

Machine learning

Thinking back to my gambling syndicate, I decided to forecast the score of each match rather than just win/lose/draw. My loss function was the RMSE of the goal difference between the predicted score and the actual score. To avoid COVID oddities, I removed the 2020-2021 season (the price being a smaller data set). Of course, I used a training and holdout dataset and cross-validation. 

The obvious question is, which model machine learning models work? I decided to try a whole bunch of them:

  • Naive mean score model. A simple model that’s just the mean scores of the (training) data set.
  • Generalized Linear Model. A form of ordinary linear regression.
  • Glmnet. Fits lasso and elastic-net regularized generalized linear models.
  • SVM. Support Vector Machines - boundary-based regression. After some experimentation, I selected the svmRadial form of SVM, which uses a non-linear kernel function.
  • KNN. K-nearest neighbors. Given that EPL scores are all in close proximity to one another, we might expect this model to return good results.
  • Neural nets.
  • XGB Linear. This is linear modeling with extreme gradient boosting. Extreme gradient boosting has gathered a lot of attention over the last few years and may be one of the most used machine learning models today.
  • XGB Tree. This is a decision tree model with extreme gradient boosting.
  • Random Forest.

The model results weren't great. For the KNN model, here's how the RMSE for full-time away goals varied with n.

Note the RMSE scale - the lowest it goes to is 1.1 goals and it's plain that adding more n will only take us a little closer to 1.1. Bear in mind, football is a low-scoring game, and being off by 1 goal is a big miss.

It was the same story for random forest.

In fact, it was the same story for all of the models. Here are my final results. My model forecast home goals and away goals.

The naive means model is the simplest and all my sophisticated models could do is give me a few percentage points improvement.

Improving the results

Perhaps the most obvious way forward is combining models to improve RMSE. I'm reluctant to do that until I can get better individual model results. There's a philosophical issue at play; for me, the ensemble approach feels a bit "spray and pray".

In my view, data shortage is the main problem:

  • My data set was only in the low thousands of matches. 
  • Some teams join the Premier League for just a season and then get relegated - I don't model their history prior to joining the league. 
  • I removed the COVID season of 2020-2021. 
  • I only had team value and disciplinary data for ten or so seasons. 
  • Of course, I only modeled the Premier League.

Football is a low-scoring game, famous for its upsets. It may well be that it's just too random underneath to make useful predictions at the individual match level. 

What's next?

I wasn't able to predict EPL results with any great accuracy, but I submitted my report and got my grade. If you want to read my report, you can read it here.

At the end of the 2021 season, I saw some papers published on the COVID effect on match results. I had similar results months before. Perhaps I should have submitted a paper myself.

At some point, I might revive this project if I can get new data. I still occasionally hunt for new data sources, but sadly, I haven't found any. My dreams of retiring to a yacht on the Mediterranean will have to wait.

Monday, November 1, 2021

Why conditional probability screwiness matters for business

Things are not what they seem

Many business decisions come down to common sense or relatively simple math. But applying common sense to conditional probability problems can lead to very wrong results as we'll see. As data science becomes more and more important for business, decisions involving conditional probability will arise more often. In this blog post, I'm going to talk through some counter-intuitive conditional probability examples and where I can, I'll tell you how they arise in a business context.

(These two pieces of track are the same size. Ag2gaeh, CC BY-SA 4.0, via Wikimedia Commons.)

Testing for diseases

This is the problem with the clearest links to business. I'll explain the classical form of the problem and show you how it can come up in a business context.

Imagine there's some disease affecting a small fraction of the population, say 1%. A university develops a great test for the disease:

  • If you have the disease, the test will give you a positive result 99% of the time. 
  • If you don't have the disease, the test will give you a negative result 99% of the time.

You take the test and it comes back positive. What's the probability you have the disease? 

(COVID test kit. Centers for Disease Control and Prevention, Public domain, via Wikimedia Commons)

The answer is 50%.

If you want an explanation of the 50% number, read the section "The math", if you want to know how it comes up in business, skip to the section "How it comes up in business".

The math

What's driving the result is the low prevalence of the disease (1%). 99% of the people who take the test will be uninfected and it's this that pushes down the probability of having the disease if you test positive. 

There are at least two ways of analyzing this problem, one is using a tree diagram and one is using Bayes' Theorem. In a previous blog post, I went through the math in detail, so I'll just summarize the simpler explanation using a tree diagram. To make it easier to understand, I'll assume a population of 10,000.

Of the 10,000 people, 100 have the disease, and 9,900 do not. Of the 100, 99 will test positive for the disease. Of the 9,900, 99 will test positive for the disease. In total 99 + 99 will test positive, of which only 99 will have the disease. So 50% of those who test positive will have the disease.

How it comes up in business

Instead of disease tests, let's think of websites and algorithms. Imagine you're the CEO of a web-based business. 1% of the visitors to your website become customers. You want to identify who'll become a customer, so you task your data science team with developing an algorithm based on users' web behavior. You tell them the test is to distinguish customers from non-customers.

They come back with a complex test for customers that's 99% true for existing customers and 99% false for non-customers. Do you have a test that can predict who will become a customer and who won't?

This is the same problem as before, if the test is positive for a user, there's only a 50% chance they'll become a customer.

How many daughters?

This is a classic problem and shows the importance of describing a problem exactly. Exactly, in this case, means using very very precise English.

Here's the problem in its original form from Martin Gardner:

  1. Mr. Jones has two children. The older child is a girl. What is the probability that both children are girls?
  2. Mr. Smith has two children. At least one of them is a boy. What is the probability that both children are boys?
(What's the probability of two girls? Circle of Robert Peake the elder, Public domain, via Wikimedia Commons)

The solution to the first problem is simple. Assuming boys or girls are equally likely, then it's 50%.

The second problem isn't simple and has generated a great deal of debate, even 60 years after Martin Gardner published the puzzle. Depending on how you read the question, the answer is either 50% or 33%. Here's Khovanova's explanation:

"(i) Pick all the families with two children, one of which is a boy. If Mr. Smith is chosen randomly from this list, then the answer is 1/3.

(ii) Pick a random family with two children; suppose the father is Mr. Smith. Then if the family has two boys, Mr. Smith says, “At least one of them is a boy.” If he has two girls, he says, “At least one of them is a girl.” If he has a boy and a girl he flips a coin to say one or another of those two sentences. In this case, the probability that both children are the same sex is 1/2."

In fact, there are several other possible interpretations.

What does this mean for business? Some things that sound simple aren't and differences in the precise way a problem is formulated can give wildly different answers.

Airline seating

Here's the problem stated from an MIT handout:

"There are 100 passengers about to board a plane with 100 seats. Each passenger is assigned a distinct seat on the plane. The first passenger who boards has forgotten his seat number and sits in a randomly selected seat on the plane. Each passenger who boards after him either sits in his or her assigned seat if it is empty or sits in a randomly selected seat from the unoccupied seats. What is the probability that the last passenger to board the plane sits in her assigned seat?"

You can imagine a lot of seat confusion, so it seems natural to assume that the probability of the final passenger sitting in her assigned seat is tiny. 

(Ken Iwelumo (GFDL 1.2, GFDL 1.2 or GFDL 1.2), via Wikimedia Commons)

Actually, the probability of her sitting in her assigned seat is 50%.

StackOverflow has a long discussion on the solution to the problem that I won't repeat here.

What does this mean for business? It's yet another example of our intuition letting us down.

The Monty Hall problem 

This is the most famous of all conditional probability problems and I've written about it before. Here's the problem as posed by Vos Savant:

"A quiz show host shows a contestant three doors. Behind two of them is a goat and behind one of them is a car. The goal is to win the car.

The host asked the contestant to choose a door, but not open it.

Once the contestant has chosen a door, the host opens one of the other doors and shows the contestant a goat. The contestant now knows that there’s a goat behind that door, but he or she doesn’t know which of the other two doors the car’s behind.

Here’s the key question: the host asks the contestant "do you want to change doors?".

Once the contestant decided whether to switch or not, the host opens the contestant's chosen door and the contestant wins the car or a goat.

Should the contestant change doors when asked by the host? Why?"

Here are the results.

  • If the contestant sticks with their initial choice, they have a ⅓ chance of winning.
  • If the contestant changes doors, they have a ⅔ chance of winning.
I go through the math in these two previous blog posts "The Monty Hall Problem" and "Am I diseased? An introduction to Bayes theorem".

Once again, this shows how counter-intuitive probability questions can be.

What should your takeaway be, what can you do?

Probability is a complex area and common sense can lead you wildly astray. Even problems that sound simple can be very hard. Things are made worse by ambiguity; what seems a reasonable problem description in English might actually be open to several possible interpretations which give very different answers.

(Sound judgment is needed when dealing with probability. You need to think like a judge, but you don't have to dress like one. InfoGibraltar, CC BY 2.0, via Wikimedia Commons)

If you do have a background in probability theory, it doesn't hurt to remind yourself occasionally of its weirder aspects. Recreational puzzles like the daughters' problem are a good refresher.

If you don't have a background in probability theory, you need to realize you're liable to make errors of judgment with potentially serious business consequences. It's important to listen to technical advice. If you don't understand the advice, you have three choices: get other advisors, get someone who can translate, or hand the decision to someone who does understand. 

Sunday, October 3, 2021

Battle leadership: some lessons for managers from World War I

Battle leadership

Books on military leadership and management have been popular in the business world for a long time. "The art of war" is a best-seller 2,500 or so years after it was written and books authored by US military leaders have consistently sold well. To state the obvious, business is not war and companies are not armies, but given this, there are still lessons military leaders have to offer; the art is picking out what applies and what doesn't.

I recently stumbled across an old military leadership book dating back to World War I. Although the world has changed greatly since then, I found some of the ideas and discussions still relevant to today's business environments. Read on to find out more.


(The book and a German soldier in World War I. Internet Archive Book Images, No restrictions, via Wikimedia Commons.)

The book and its history

The book is "Battlefield Leadership" and was first published in 1933 in the US. The author was Adolph von Schell (1893-1967), an officer in the German army in the First World War. He led troops in the European theater during the war, winning a number of medals and commendations [https://de.wikipedia.org/wiki/Adolf_von_Schell_(General,_1893)]. After the war, he trained at the US Army's Fort Benning where he was asked to speak about his experiences leading men in combat. His talks became the book, "Battlefield Leadership" which was published in English (strangely, the German translation was published much later).

The book is an odd mix of psychology, management, and battlefield stories, not all of which are relevant to the world of business. Plainly, warfare has moved on a lot since the First World War; tactics and strategy have changed greatly, but what hasn't changed are some of the core ideas of people management, as we shall see.

Core ideas

Battlefield psychology

Schell states that in "modern" warfare, people fight in small groups, often as individuals, against an enemy they can't see. Therefore, commanders need to know how individuals are likely to react and how they can be influenced. A unit may well have soldiers from diverse backgrounds, so the commander has to have an appreciation of their culture. Similar lessons apply further up the command hierarchy; a general must know his subordinates and how to motivate them: "Furthermore, each one reacts differently at different times, and must be handled each time according to his particular reaction", Schell gives a crude example in the book, but despite the crudity, the underlying lesson is clear.

He has some valid points to make about the need for individuals to exert some measure of control over their situation. Soldiers that wait under hostile fire have time to think and become stressed because they can't change their situation; they lie waiting for bullets to hit them. Soldiers on patrol are more at risk, but their destiny is in their hands so they're willing to go out and take control of their situation. He talks about soldiers under fire moving their position to be more secure: "it makes no difference whether or not the security is real; it is simply a question of feeling that it is".

Men under fire need some measure of security. Schell gives examples of a commander who ordered his men to have haircuts while their position was being shelled. The point isn't the haircuts, the point is the sense of normality the haircutting process gave. Even though men died during the shelling, morale stayed high because the team had a sense of security.

Experience matters in many ways

In several places throughout the book, Schell gives examples of how experienced troops behave compared to troops that had not seen combat. He gives an example from the early days of the war when his company of inexperienced troops first crossed into Belgium; despite meeting no resistance, they shot at shadows in the forest and spent a restless first night afraid of an attack that never came. By contrast, later in the war, he led battle-hardened troops in Russia. Despite similar circumstances, they didn't shoot at shadows in the forest, instead, they posted two sentries and the rest of the company slept soundly, even though they were in enemy territory.

As a practical matter, he recommends mixing inexperienced and battle-hardened troops. He comments that even on a day-to-day basis, and away from battle, seasoned troops coach the inexperienced troops on what to do and how to behave. He similarly cautions against changing commanders, a commander has to get to know his troops, and wartime is not an ideal time to do it:
"If we give these inexperienced troops a backbone of experienced soldiers and experienced commanders their efficiency will be tremendously increased and they will be spared heavy losses."

False data and preparation

I'm going to quote two of his lessons verbatim:
"(1) At the commencement of war, soldiers of all grades are subject to a terrific nervous strain. Dangers are seen on every hand. Imagination runs riot. Therefore, teach your soldiers in peace what they may expect in war, for an event foreseen and prepared for will have little if any harmful effect.
(2) As leaders be careful both in sending and in receiving reports. At the commencement of a war, ninety percent of all reports are false or exaggerated."

Change the word "war" for "competitive situation" and you get something obviously relevant for business.

Orders based on incomplete information

Quite correctly, Schell points out that leaders have to make decisions based on partial information, and on information which is doubtful at best: 
"In open warfare a leader will have to give his orders without having complete information. At times only his own will is clear. If he waits for complete information before acting he will never make a decision."

Orders, improvisation, and maps

This is my favorite part of the book. It recalls an action where the Germans and Russians faced off against one another in Russian territory. The German commander received continual updates on the Russians' position and he changed his tactics in response to the new information. His commands to his men were clear, simple, and to the point. Here's his summary:
"This example shows clearly that difficult situations can be solved only by simple decisions and simple orders. The more difficult the situation the less time there will be to issue a long order, and the less time your men will have to understand it. Moreover, the men will be high-strung and tense. Only the simplest order can be executed under such conditions."

In the same action, the Germans were facing a larger Russian opponent. They needed to watch a Russian position but didn't have the troops to. A corporal solved the problem. There was a large herd of cows in a nearby village, so the Germans moved the cows to a field between them and the Russians. Whenever the Russians tried to advance across the field, they disturbed the cows, so alerting the Germans.

(Cows are free watchdogs. Jonas Eppler, CC BY-SA 4.0, via Wikimedia Commons)

Eventually, of course, superior numbers prevailed and the Germans had to retreat. The commander knew where to retreat to, but didn't have a map. Waiting for a map would have meant defeat, so based only on a rough knowledge of geography, they retreated, eventually joining up with other companies. The commander had to take a risk by retreating into unknown territory, but not retreating would have been more dangerous.

Mission orders

Notably, Schell discusses 'mission tactics' which are better known as 'mission orders' today. The idea is simple, commanders will achieve more if they can exert some control, so orders should focus on the mission and not on how it's to be executed. There's a sound operational reason too: "This is done because the commander on the ground is the only one who can correctly judge existing conditions and take the proper action if a change occurs in the situation". The relations to modern business are obvious: give senior people goals to achieve and give them the freedom to do it in whatever way they can. 

Some miscellaneous quotes

I found little nuggets of wisdom throughout the book, here are some I want to share:

  • "To leave the bulk of the artillery behind may strike the reader as dangerous but I believe the decision to do so was correct. The Germans were pursuing and almost anything can be dared where opposed to a beaten opponent. Everything had to be sacrificed to speed if the Russians were to be overtaken. In this situation legs were the important thing, not cannon."
  • "The importance of surprise in war cannot be overestimated. As it becomes increasingly difficult to obtain so does it become increasingly effective when it is obtained. No effort should be spared to make the decisive element of surprise work for us in war."
  • "There is only one opportunity to issue detailed orders and that is before battle. When the action has actually begun, orders must be short and simple."
  • "Every fight develops differently than is expected. Officers and troops must realize this in peace, in order that they will not lose courage when the unexpected occurs in war."

There's very little new in management

This book was first published in 1933 based on Schell's experience in war over the period 1915-1918.  There's more in this short book than I've seen in some much longer and more recent management books, and frankly, there's more of substance in this book than some highly-paid consultants I've worked with. Is this the only management book you should read? Absolutely not. Does it contain some interesting insights? Yes it does. 

The book was published in 1933 and the author died in 1967. It's not clear to me what the copyright situation is. You can buy a copy cheaply on Amazon, but you can also find free PDFs available online from legitimate sources.

Monday, August 16, 2021

The seven dysfunctionalities of management books

The problems with popular management books

Over the years, I've read many management books ranging from the excellent to the terrible. I've noticed several dysfunctionalities that creep into even some of the best books. I'm going to list them out in what I think is their order of importance. See what you think.

The seven dysfunctionalities

My idea is worth 30 pages, I'll write 300

With few exceptions, most books fall into this trap. The author could express their ideas in a few pages and provide supporting evidence that would fill a few pages more. Of course, the economics of books means they can't. There's no market and no money in a 30-page pamphlet (when was the last time you paid $20 for 30 pages?) but there's a huge market for books. The logic is clear: spin out your idea to book-length and make some money.

This is a little odd for two reasons:

  • Business writing emphasizes brevity and getting to the point quickly - neither of which management books usually do.
  • No one has disrupted the market. Maybe our business culture and market economics mean disruption is impossible?

What I say is important, I worked with important people at important companies

This is a relatively new dysfunction. The author claims their work is important, not because of its widespread adoption, or because many people had success with it, but because they held senior positions at well-known companies in Silicon Valley. Usually, these books have lots of stories of famous people, some of which offer insight and some of which don't. In a few cases, the storytelling degenerates into name-dropping.

My evidence will be stories or bad data

The plural of anecdote is not data. Why should I believe your experience generalizes to me? Storytelling is important, but it doesn't amount to a coherent management framework. According to the esteemed Karl Popper, science is about making falsifiable statements - what falsifiable statements do stories make?

The other form of dysfunctional evidence is bad data. The problems here are usually regression to the mean, small sample sizes, or a misunderstanding of statistics. There are examples of management gurus developing theories of winning companies but whose theories were proved wrong almost as soon as the ink was dry on their books. This might be why newer books focus on storytelling instead.

I'll write a worse sequel and then an even worse sequel to that

Even the best authors fall prey to this trap. They publish a best-selling book and the temptation is there to write a sequel. The second book is usually so-so, but might sell well. So they write a third book which is even worse, and so on.

I'll create new words for old ideas

Here the author rediscovers psychology or sociology that's been known for decades. Sometimes, they'll admit it and provide a new twist on old ideas; but sometimes it's just known ideas repackaged. In any case, the author usually creates a trendy buzzy phrase for their idea, preferably one they can trademark for their consultancy practice.

I'll talk about my time in the military

The military does have some very interesting things to teach managers. Unfortunately, most of the military books for business management focus on events without providing much in the way of context for what happened and why. When they explain how it can be used in a civilian setting, it feels clunky and unconvincing. These military books also tend to focus on successes and brush over failures (if they mention any at all). This is sad because I've read some really great older military management books that have something to offer today's managers.

I'll push my consulting company

This is the original sin and the cause of many of the other sins. After the success of their book, the author forms a consultancy company. They create a 2nd edition that includes cherry-picked success stories from their consulting company, or maybe they write a second book with anecdotes from their consulting work. The book then becomes a 'subtle' promo for their consulting work.

Don't throw the baby out with the bathwater

I'm not saying that popular business management books have no value, I'm saying very few of them will have value in ten years' time when the hype has passed. Think back to the business books published ten or twenty years ago. How many stand up now? 

Despite the faddish nature of the genre, most business management books have the core of some good ideas, you just have to wade through the nonsense to get there.

What should you do?

Every manager needs a framework for decision-making. My suggestion is to get that framework from training and courses and not popular business books. Use quotes to get some extra insight. Management business books are useful for a refresher of core ideas, even if you have to wade through 300 pages instead of 30. If nothing else, the popular books are a handy guide to what your peers are reading now.

Monday, August 9, 2021

Criminal innovations: narco-subs

How do you transport lots of drugs internationally without getting caught?

The United States is one of the world's largest consumers of illegal drugs but the majority of the illegal drugs it consumes are manufactured in South America. Illegal drug producers need to transport their product northwards at the lowest price while evading detection. They've tried flying, but radar and aircraft have proved effective at stopping them, and they've tried boats, but coastguard patrols and radar have again stopped them. If you can't go over the water, and you can't go on the water, then how about going under the water? Drug cartels have turned to submarines and their variants for stealthy transportation. These submarines go by the generic name of narco-subs. As we'll see, it's not just the South Americans who are building submarines for illegal activities.

South American narco-subs

The experts on transporting drugs long distances by sea are the South American drug cartels; they've shown an amazing amount of innovative thinking over the years. Currently, they're using three main types of craft: low-profile vessels, submarines, and torpedoes. Low-profile vessels and submarines typically have small crews of 2-4 people, while torpedoes are uncrewed.

Low-profile vessels (LPVs)

To avoid radar and spotter planes, the cartels have turned to stealth technology; they've designed boats that have a very low radar cross-section with the smallest possible above-the-sea structures. 

(A low-profile vessel that was intercepted. Image source: US Customs and Border Protection.)
(Another low-profile vessel. Image source: US Customs and Border Protection.)

These vessels originally started as variations on existing commercial speedboats, with modifications to make them run lower in the water. Now, they're custom designs, typically long and thin, designed to pierce waves rather than ride over them. A typical newer LPV might be 3m wide by 30m long - quite a long vessel, but very narrow. H.I. Sutton describes several types of LPV in his Forbes article.

Submarines

There are various types of narco-subs, ranging from semi-submersibles to full-on submarines.

Semi-submersibles ride just below the surface, typically at snorkel depth. This image of a 2019 semi-submersible captured off Peru gives you the general idea.

(Semi-submersible narco-sub, Peru, 2019. Image source: Wikimedia Commons.)

The vessel is plainly based on a 'standard' boat and is designed to run just under the water. The very few above-surface structures make the vessel hard to spot with radar, or even from the air.

The Peruvian vessel is plainly a modified boat, but custom-built vessels exist, here's an image of one custom semi-submersible used by Columbian drug smugglers just before its capture in 2007. The blue paint job is camouflage.

(Semi-submersible narco-sub caught in 2007. Image source: Wikimedia Commons.)

This September 2019 image shows USCG boarding a 12m semi-submersible in the eastern Pacific. It had a crew of 4 and was carrying $165mn in cocaine.

(Source: Navy Times)

The drug cartels have created true submarines capable of traveling under the water to depths of a few hundred feet. Some of these submarines have even reached the astonishing length of 22m, making them comparable to midget submarines used by the world's navies (see Covert Shores comparison). 

In 2010, this 22 m-long monster was discovered in the Ecuadorian jungle. NPR has a long segment on how it was found and what happened next. The sub is estimated to have a range of 6,800 nautical miles and a dive depth of 62 feet. These numbers aren't impressive by military standards but bear in mind, this sub is designed for stealth, not warfare.

(22m long, fully submersible narco-sub. Image source: Wikimedia Commons.)

This isn't even the largest sub found, Hannah Stone reports on one narco-sub with a length of 30m, a crew of 4, air conditioning, and a small kitchen!

In November 2019, a narco-sub was caught in Galicia in Spain. Although the design was nothing new, its origin was. Authorities believe it started its journey in Brazil, crossing the Atlantic ocean to get to Spain (Covert Shores). This vessel was a semi-submersible design.

Bear in mind, all these submarines were built surreptitiously, often far away from population centers, which means no cutting-edge machine tools or precision parts and limited material supply. The subs are often constructed using wood and fiberglass - not special-purpose alloys.

Torpedoes

This is a relatively new innovation. Torpedoes are submersible vessels typically towed behind fishing vessels or other ships. If the ship is intercepted, the torpedo is cut loose, and after a period of time, it surfaces a camouflaged marker, allowing it to be retrieved after the authorities have gone.

This article on Insight Crime describes how torpedoes work in practice.

European narco-subs

It's not just the South Americans who are creating narco-subs, the Europeans are at it too. In February 2020, Spanish police raided a warehouse in Málaga where they found a very sophisticated narco-sub under construction. This is a well-constructed vessel, using hi-tech parts imported from countries around Europe. The paint job isn't accidental either - it's all about stealth.


(Image source: Europol)

Covert Shores reports that this is the fourth narco-sub caught in Spain.

Transporting cars illegally

So far, I've focused on narco-subs and drug trafficking, but similar technology has been used for other criminal activities. In China, Armored Stealth Boats have been used to traffic stolen luxury cars. The whole thing seems to be so James Bond, it can't be true, but it is. Covert Shores has an amazing article and images on the whole thing.

Some disturbing thoughts

There's a tremendous amount of risk-taking going on here; how many of these subs end up at the bottom of the sea? On the flip side, how many are getting through undetected? Of course, if large amounts of drugs can be transported this way, what about other contraband? Many of these subs are constructed with relatively primitive equipment and materials. What could a rogue nation-state do with up-to-date machine tools and modern materials?

Innovation - but for the wrong ends

All this innovation is amazing. The idea of constructing a submarine in the jungles of South America with limited materials and piloting it across the Atlantic is incredible. The sad thing is, all this creative effort is in support of criminal activity. It would be great if this get-up-and-go could be directed at something that benefits people instead. It seems to me that the fundamental problem is the economic incentive system - drugs pay well and there are few alternatives in the jungle. 

Reading more

The expert on narco-subs, and indeed on many OSINT aspects of naval warfare, is H.I. Sutton, who produces the website Covert Shores. If you want to read more details about narco-subs, check out his great website, Covert Shores.

USNI covers stories on narco-subs and other naval topics.

"Narco-Submarines: Specially Fabricated Vessels Used for Drug Smuggling Purposes" is a little old, but it's still good background reading.

Monday, August 2, 2021

Poleaxed opinion polls: the ongoing 2020 disaster

Why the polls failed in the US Presidential Election of 2020

In the wake of the widespread failure of opinion polls to accurately predict the outcome of the 2020 US Presidential election, the American Association for Public Opinion Research (AAPOR) commissioned a study to investigate the causes and make recommendations. Their findings were recently released.

(This is the key question for 2020 opinion pollsters. The answer is yes, but they don't know why. Image source: Wikimedia)

Summary of AAPOR's findings

I've read the report and I've dug through the findings. Here's my summary:

  1. The polls overstated support for Democratic candidates.
  2. We don't really know why.
  3. Er... that's it.

Yes, I'm being harsh, but I'm underwhelmed by the report and I find some of the statements in it unconvincing. I'll present some of their main findings and talk through them. I encourage you to read the report for yourself and reach your own conclusions.

(We don't know why we didn't get the results right.)

Factors they ruled out for 2020

  • Late-breaking changes in favor of Republican candidates. This happened in 2016 but didn't happen in 2020. The polls were directionally consistent throughout the campaign.
  • Weighting for education. In 2016, most polls didn't weight for education and education did seem to be a factor. In 2020, most polls did weigh for education. Educational weighting wasn't a factor.
  • Pollsters got the demographics wrong. Pollsters don't use random sampling, they often use stratified sampling based on demographics. There's no evidence that errors in demographics led to widespread polling errors in 2020.
  • People were afraid to say they voted for Trump. In races not involving Trump, the opinion polls were still wrong and still favored Democratic candidates. Trump wasn't the cause.
  • Intention to vote vs. actually voting. The results can't be explained by voters saying they were going to vote but who didn't actually vote. For example, if Democratic voters said they were going to vote Democratic and didn't actually vote, this would explain the error, but it didn't happen.
  • Proportion of early voters or election day voters. Early voting/election day voting didn't make a difference to the polling error.

Factors they couldn't rule out

  • Republican voters chose not to take part in surveys at a higher number than Democratic voters.
  • The weighting model used to adjust sampling may have been wrong. Pollsters use models of the electorate to adjust their results. If these models are wrong, the results will be biased.
  • Many more people voted in 2020 than in 2016 ("new voters" in the report) - maybe pollsters couldn't model these new voters very well.

Here's a paragraph from the report:

"Unfortunately, the ability to determine the cause or causes of polling error in 2020 is limited by the available data. Unless the composition of the overall electorate is known, looking only at who responded says nothing about who did not respond. Not knowing if the Republicans (or unaffiliated voters, or new voters) who responded to polls were more supportive of Biden than those who did not respond, for example, it is impossible to identify the primary source of polling error."

Let me put that paragraph another way: we don't have enough data to investigate the problem so we can't say what went wrong.

Rinse and repeat - or just don't

I'm going to quote some sentences from the report's conclusions and comments:

  • "Considering that the average margin of error among the state-level presidential polls in 2020 was 3.9 points, that means candidate margins smaller than 7.8 points would be difficult to statistically distinguish from zero using conventional levels of statistical significance. Furthermore, accounting for uncertainty of statistical adjustments and other factors, the total survey error would be even larger."
  • "Most pre-election polls lack the precision necessary to predict the outcome of semi-close contests."
  • "Our investigation reveals a systemic overstatement of the Democratic-Republican margin in nearly every contest, regardless of mode or proximity to the election. This overstatement is largest in states with more Republican supporters"

Some of the report's statements are extraordinary if you stop and think for a moment. I want you to ponder the key question: "what use are polls"?

The people paying for polls are mostly (but not completely) political campaigns and the media. The media want to report on an accurate snapshot of where the election is now and make an assessment of who will win. Political campaigns largely want the same thing. 

In places like Alaska or Hawaii, polls aren't very useful because voters tend to vote strongly Democratic or Republican. For example, Wyoming is overwhelmingly a Republican stronghold, and Washington D.C. a Democratic stronghold. My forecast for 2024 is simple: Wyoming will vote Republican and Washington D.C. Democratic. 

Polls are useful where the race is close, or, in the words of the report "semi-close". But, according to the report, polls in semi-close states don't have sufficient accuracy to predict the result.

So, if polls aren't useful in strongly Democratic or Republican states, and they lack predictive power in "semi-close" races, what use are they? Why should anyone pay for them?

There's an even deadlier issue for polling organizations. You can very clearly judge the accuracy of political opinion polls. Opinion poll companies run all kinds of polls on all kinds of topics, not just elections. How accurate are they in other areas where their success is harder to assess?

Where to next?

The polling industry has an existential credibility crisis. It can't continue to sell a product that doesn't work. It's extraordinary that an industry that's been around for nearly 100 years doesn't have the data to diagnose its failures. The industry needs to come together to fix its problems as soon as possible - or face irrelevancy in the near future.

Monday, July 26, 2021

Reconstructing an unlabelled chart

What were the numbers?

Often in business, we're presented with charts where the y-axis is unlabeled because the presenter wants to conceal the numbers. Are there ways of reconstructing the labels and figuring out what the data is? Surprisingly, yes there are.

Given a chart like this:

you can often figure out what the chart values should be.

The great Evan Miller posted on this topic several years ago ("How To Read an Unlabeled Sales Chart"). He discussed two methods:

  • Greatest common divisor (gcd)
  • Poisson distribution

In this blog post, I'm going to take his gcd work a step further and present code and a process for reconstructing numbers under certain circumstances. In another blog post, I'll explain the Poisson method.

The process I'm going to describe here will only work:

  • Where the underlying data is integers
  • Where there's 'enough' range in the underlying data.
  • Where the maximum underlying data is less than about 200.
  • Where the y-axis includes zero. 

The results

Let's start with some results and the process.

I generated this chart without axes labels, the goal being to recreate the underlying data. I measured screen y-coordinates of the top and bottom plot borders (187 and 677) and I measured the y coordinates of the top of each of the bars. Using the process and code I describe below, I was able to correctly recreate the underlying data values, which were \([33, 30, 32, 23, 32, 26, 18, 59, 47]\).

How plotting packages work

To understand the method, we need to understand how a plotting package will render a set of integers on a chart.

Let's take the list of numbers \([1, 2, 3, 5, 7, 11, 13, 17, 19, 23]\) and call them \(y_o\). 

When a plotting package renders \(y_o\) on the screen, it will put them into a chart with screen x-y coordinates. It's helpful to think about the chart on the screen as a viewport with x and y screen dimensions. Because we only care about the y dimensions, that's what I'll talk about. On the screen, the viewport might go from 963 pixels to 30 pixels on the y-axis, a total range of 933 y-pixels.

Here's how the numbers \(y_o\) might appear on the screen and how they map to the viewport y-coordinates. Note the origin is top left, not bottom right. I'll "correct" for the different origin.

The plotting package will translate the numbers \(y_o\) to a set of screen coordinates I'll call \(y_s\). Assuming our viewport starts from 0, we have:

\[y_s = my_o\]

Let's just look at the longest bar that corresponds to the number 23. My measurements of the start and end are 563 and 27, which gives a length of 536. \(m\) in this case is 536/23, or 23.3.

There are three things to bear in mind:

  • The set of numbers \(y_o\) are integers
  • The set of numbers \(y_s\) are integers - we can't have half a pixel for example.
  • The scalar \(m\) is a real number

Integer only solutions for \(m\) 

In Evan Miller's original post, he only considered integer values of \(m\). If we restrict ourselves to integers, then most of the time:

\[m = gcd(y_s)\]

where gcd is the greatest common divisor.

To see how this works, let's take:

\[y_o = [1 , 2,  3]\]

and

\[m = 8\]

These numbers give us:

\[y_s = [8, 16, 24]\]

To find the gcd in Python:

np.gcd.reduce([8, 16, 24])

which gives \(m = 8\), which is correct.

If we could guarantee \(m\) was an integer, we'd have an answer; we'd be able to reconstruct the original data just using the gcd function. But we can't do that in practice for three reasons:

  1. \(m\) isn't always an integer.
  2. There are measurement errors which means there will be some uncertainty in our \(y_s\) values.
  3. It's possible the original data set \(y_o\) has a gcd which is not 1.

In practice, we gather screen coordinates using a manual process which will introduce errors. At most, we're likely to be off by a few pixels for each measurement, however, even the smallest error will mean the gcd method won't work. For example, if the value on the screen should be 500 but we might incorrectly measure it as 499, this small error means the method fails (there is a way around this failure that will work for small measurement errors.)

If our original data set has a gcd greater than 1, the method won't work. Let's say our data was:

\[y_o = [2, 4, 6] \]

and:

\[m=8\]

we would have:

\[y_s = [16, 32, 48]\]

which has a gcd of 16, which is an incorrect estimate of \(m\). In practice, the odds of the original data set \(y_o\) having a gcd > 1 are low.

The real killer for this approach is the fact that \(m\) is highly likely in practice to be a real number.

Real solutions for \(m\)

The only way I've found for solving for \(m\) is to try different values for \(m\) to see what succeeds. To get this to work, we have to constrain \(m\) because otherwise there would be an infinite number of values to try. Here's how I constrain \(m\):

  • I limit the steps for different \(m\) values to 0.01.
  • I start my m values from just over 1 and I stop at a maximum \(m\) value. My maximum \(m\) value I get from assuming the smallest value I measure on the screen corresponds to a data value of 1, for example, if the smallest measurement is 24 pixels, the smallest possible original data is 1, so the maximum value for \(m\) is 24. 

Now we've constrained \(m\), how do we evaluate \(y_s = my_o\)? First off, we define an error function. We want our estimates of the original data \(y_o\) to be integers, so the further away we are from an integer, the worse the error. For the \(i\)th element of our estimate of \(y_o\), the error estimate is:

\[\frac{y_{si}}{m_{estimate}} -  \frac{y_{si}}{m_{estimate}}\]

we're choosing the least square error, which means minimizing:

\[ \frac{1}{n} \sum  \left ( round \left ( \frac{y_{si}}{m_{estimate}} \right ) -  \frac{y_{si}}{m_{estimate}} \right )^2 \]

in code, this comes out as:

sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

Our goal is to try different values of \(m\) and choose the solution that yields the lowest error estimate.

The solution in practice

Before I show you how this works, there are two practicalities. The first is that \(m=1\) is always a solution and will always give a zero error, but it's probably not the right solution, so we're going to ignore \(m=1\). Secondly, there will be an error in our measurements due to human error. I'm going to assume the maximum error is 3 pixels for any measurement. To calculate a length, we take a measurement of the start and end of the bar (if it's a bar chart), which means our maximum uncertainty is 2*3. That's why I set my maximum \(m\) to be min(y) + 2*MAX_ERROR.

To show you how this works, I'll talk you through an example.

The first step is measurement. We need to measure the screen y-coordinates of the plot borders and the top of the bars (or the position of the points on a scatter chart). If the plot doesn't have borders, just measure the position of the bottom of the bars and the coordinate of the highest bar. Here are some measurements I took.

Here are the measurements of the top of the bars (_y_measured): \([482, 500, 489, 541, 489, 523, 571, 329, 399]\)

Here are the start and stop coordinates of the plot borders (_start, _stop):  \(677, 187\)

To convert these to lengths, the code is just: [_start - _y_m for _y_m in _y_measured]

The length of the screen from the top to the bottom is: _start - _stop = \(490\)

This gives us measured length (y_measured): \([195, 177, 188, 136, 188, 154, 106, 348, 278]\)

Now we run this code:

MAX_ERROR = 3

STEP = 0.01

ERROR_THRESHOLD = 0.01


def mse(y, div):

    """Means square error calculation."""

    return sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)


def find_divider(y):

    """Return the non-integer that minimizes the error function."""

    error_list = []  

    for _div in np.arange(1 + STEP, 

                          min(y) + 2*MAX_ERROR, 

                          STEP):

        error_list.append({"divider": _div, 

                           "error":mse(y, _div)})

    df_error = pd.DataFrame(error_list)

    df_error.plot(x='divider', y='error', kind='scatter')

    _slice = df_error[df_error['error'] == df_error['error'].min()]

    divider = _slice['divider'].to_list()[0]

    error = _slice['error'].to_list()[0]

    if error > ERROR_THRESHOLD:

        raise ValueError('The estimated error is {0} which is '

                          'too large for a reliable result.'.format(error))

    return divider


def find_estimate(y, y_extent):

    """Make an estimate of the underlying data."""

    if (max(y_measured) - min(y_measured))/y_extent < 0.1:

        raise ValueError('Too little range in the data to make an estimate.')  

    m = find_divider(y)

    return [round(_e/m) for _e in y_measured], m

estimate, m = find_estimate(y_measured, y_extent)

This gives us this output:

Original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Measured y values: [195, 177, 188, 136, 188, 154, 106, 348, 278]

Divider (m) estimate: 5.900000000000004

Estimated original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Which is correct.

Limitations of this approach

Here's when it won't work:

  • If there's little variation in the numbers on the chart, then measurement errors tend to overwhelm the calculations and the results aren't good.
  • In a similar vein, if the numbers are all close to the top or the bottom of the chart, measurement errors lead to poor results.
  • \(m < 1\), which as the maximum y viewport range is usually in the range 500-900 pixels, it won't work for numbers greater than about 500.
  • I've found in practice that if \(m < 3\) the results can be unreliable. Arbitrarily, I call any error greater than 0.01 too high to protect against poor results. Maybe, I should limit the results to \(m > 3\).

I'm not entirely convinced my error function is correct; I'd like an error function that better discriminates between values. I tried a couple of alternatives, but they didn't give good results. Perhaps you can do better.

Notice that the error function is 'denser' closer to 1, suggesting I should use a variable step size or a different algorithm. It might be that the closer you get to 1, the more errors and the effects of rounding overwhelm the calculation. I've played around with smaller step sizes and not had much luck.

Future work

If the data is Poisson distributed, there's an easier approach you can take. In a future blog post, I'll talk you through it.

Where to get the code

I've put the code on my Github page here: https://github.com/MikeWoodward/CodeExamples/blob/master/UnlabeledChart/approxrealgcd.py