Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Wednesday, June 25, 2025

AI networking in the Boston area

A lot's happening in Boston - where should I find out more?

There's a lot of AI work going on in the Boston area covering the whole spectrum, from foundational model development, to new AI applications, to corporates developing new AI-powered apps, to entrepreneurs creating new businesses, to students building prototypes in 12 hours. Pretty much every night of the week you can go to a group where you can find out more; there are a ton of different groups out there. But not all of them are created equal. I've been to a lot of groups and here are my recommendations for the best ones that meet on a regular basis. The list is alphabetical.

(Google Gemini)

AI Tinkerers

What it is

Monthly meeting where participants show the AI projects they've been working on. Mostly, but not exclusively, presentations from the Sundai Club (Harvard and MIT weekly hackathons). Attendance is over 150.

Commentary

This is where I go to when I want to see what's possible and find out more about the cutting edge. It's where I found out what tools like Cursor could really do. There are a number of VCs in attendance watching for anything interesting.

How often it meets

Once a month at Microsoft NERD.

Positives

You get to see what the cutting edge is really like.

Negatives

I found networking at this event less useful than some of the other events.

How to join

https://boston.aitinkerers.org/

AI Woodstock

What it is 

A networking event for people interested in AI. It attracts practitioners,  some VCs, recruiters, academics, and entrepreneurs. Attendee numbers vary, but typically over 100.

Commentary

This is networking only, there are no presentations or speakers of any kind. You turn up to the venue and introduce yourself to other people, and get talking. I've met people who are starting companies, people who are working on side gigs, and people who are working in AI for large companies. 

The quality is high; I've learned a lot about what's going on and what companies in the Boston area are doing. 

The venue is both good and bad. It's held in a corner of the Time Out Market near Fenway Park. This is a large space with lots of food and drink vendors, it attracts the bright young things of the Boston area who go there to eat and drink after work. AI Woodstock doesn't take over the whole space or rope off a portion of it and AI Woodstock attendees are only identified by name badges. This means you're chatting away to someone about their AI enabled app while someone is walking by with their drink and app to meet their friends. The background noise level can be really high at times.

How often it meets 

Once a month at the Time Out Market near Fenway Park.

Positives

  • Networking. This is one of the best places to meet people who are active in AI in Boston.
  • Venue. It's nice to meet somewhere that's not Cambridge and the food and drink offerings are great.

Negatives

  • Venue. The noise level can get high and it can get quite crowded. The mix of bright young things out to have a good time and AI people is a bit odd.

How to join

https://www.meetup.com/ai-woodstock/ - choose Boston

Boston Generative AI Meetup

What it is

This is a combination of networking and panel session. During the networking, I've met VCs, solo entrepreneurs, AI staff at large companies, academics, and more. Attendance varies, but typically over 200.

Commentary

This is held in Microsoft NERD in Cambridge and it's the only event in the space. This means it starts a bit later and has to finish on time. 

Quality is very high and I've met a lot of interesting people. I met someone who showed me an app they'd developed and told me how they'd done it, which was impressive and informative.

The panel sessions have been a mixed bag; it's interesting to see people speak, and I found out a lot of useful information, but the panel topics were just so-so for me. Frankly, what the panelists said was useful but the overall topic was not.

How often it meets

About once a month.

Positives

  • Networking. 
  • Venue.
  • Information. The panels have mentioned things I found really useful.

Negatives

  • Panel session topics were a bit blah.

How to join

https://www.meetup.com/boston-generative-ai-meetup/

PyData Boston

What it is

Presentations plus networking. This is almost all machine learning/data science/AI practitioners in the Boston area (no VCs, no business people, instead there are academics and engineers). The presentations are mostly on technical topics, e.g. JAX. Attendance varies, but usually 50-100.

Commentary

I've learned more technical content from this group than any other. The presentations are in-depth and not for people who don't have a goodish background in Python or data science.

How often it meets

Once a month, usually at the Moderna building in Cambridge.

Positives

  • Best technical event. In-depth presentations have helped educate me and point out areas where I need to learn more. Conversations have been (technically) informative.
  • Probably the friendliest group of all of them.

Negatives

  • No entrepreneurs, no VCs, no executive management.

How to join

https://www.meetup.com/pydata-boston-cambridge/

Common problems

There's a refrain I've heard from almost all event organizers and that's the problem of no-shows. The no-show rate is typically 40% or so, which is hugely frustrating as there's often a a waiting list of attendees. Some of these events have instituted a sign-in policy, if you don't turn up and sign in, you can't attend future events, and I can see more events doing it in future. If you sign up, go.

One-off events

As well as these monthly events, there are also one-off events that happen sporadically. Obviously, I can't review them here, but I will say this, the quality is mostly very high but it is variable.

What's missing

I'm surprised by what I'm not hearing at these events. I'm not hearing implementation stories from existing ("mature") companies. Through private channels, I'm hearing that the failure rate for AI projects can be quite high, but by contrast I've been told that insurance companies are embracing AI for customer facing work and getting great results. I've met developers working on AI enabled apps for insurance companies and they tell me their projects have management buy-in and are being rolled out.

I'd love to hear someone from one of these large companies get up and speak about what they did to encourage success and the roadblocks on the way. In other words, I'd like to see something like "Strategies and tactics for successful AI projects" run by people who've done it.

Your thoughts

I've surely missed off groups from this list. If you know of a good group, please let me know either through LinkedIn or commenting on this post.

Logistic regression - a simple briefing

A briefing on logistic regression

I've been looking again at logistic regression and going over some of the theory behind it. In a previous blog post, I talked about how I used Manus to get a report on logistic regression and I showed what Manus gave me. I thought it was good, B+, but not great, and I had some criticisms of what Manus produced. The obvious challenge is, could I do better? This blog post is my attempt to explain logistic regression better than Manus.

What problems are we trying to solve?

There are a huge class of problems where we’re trying to predict a binary result, here are some examples:

  • The results of a referendum, e.g., whether or not to remain in or leave the EU.
  • Whether to give drug A or drug B to a patient with a condition.
  • Which team will win the World Cup or Super Bowl or World Series.
  • Is this transaction fraudulent?

Typically, we’ll have a bunch of different data we can use to base our prediction model on. For example, for a drug choice, we may have age, gender, weight, smoker or not and so on. These are called features. Corresponding to this feature data set, we’ll have a set of outcomes (also called labels), for example, for the drug case, it might be something like percentage survival (a% survived given drug A compared to b% for drug B). This makes logistic regression a supervised machine learning method.

In this blog post, I’ll show you how you can turn feature data into binary classification predictions using logistic regression. I’ll also show you how you can extend logistic regression beyond binary classification problems.

Before we dive into logistic regression, I need to define some concepts.

What are the odds?

Logistic regression relies on the odds or the odds ratio, so I’m going to define what it is using an example.

For two different drug treatments, we have different rates of survival. Here’s a table adapted from [1] that shows the probability of survival for fictitious study. 

Standard treatment New treatment Totals
Died 152 (38%) 17 169
Survived 248 (62%) 103 351
Totals 400 (100%) 120 520

Plainly, the new treatment is much better. But how much better?

In statistics, we define the odds as being the ratio of the probability of something happening to it not happening:

\[odds = \dfrac{p}{1 - p}\]

So, if there’s a 70% chance of something happening, the odds of it happening are 2.333. Probabilities can range from 0 to 1 (or 0% to 100%), whereas odds can range from 0 to infinity. Here’s the table above recast in terms of odds.

Standard treatment New treatment
Died 0.613 0.165
Survived 1.632 6.059

The odds ratio tells us how much more likely an outcome is. A couple of examples should make this clearer. 

The odds ratio for death with the standard treatment compared to the new is:

\[odds \: ratio = \dfrac{0.613}{0.165} = 3.71...\]

This means a patient is 3.71 times more likely to die if they’re given the standard treatment compared to the new.

More hopefully, the odds ratio for survival with the new treatment compared to the old is:

\[odds \: ratio = \dfrac{6.059}{1.632} = 3.71...\]

Unfortunately, most of the websites out there are a bit sloppy with their definitions. Many of them conflate “odds” and “odds ratio”. You should be aware that they’re two different things:

  • The odds is the probability of something happening divided by the probability of it not happening.
  • The odds ratio compares the odds of an event in one group to the odds of the same event in another group.

The odds are going to be important for logistic regression.

The sigmoid function

Our goal is to model probability (e.g. the probability that the best treatment is drug A), so mathematically, we want a modeling function that has a y-value that varies between 0 and 1. Because we’re going to use gradient methods to fit values, we need the derivative of the function, so our modeling function must be differentiable. We don’t want gaps or ‘kinks’ in the modeling function, so we want it to be continuous.

There are many functions that fit these requirements (for example, the error function). In practice, the choice is the sigmoid function for deep mathematical reasons; if you analyze a two-class distribution using Bayesian analysis, the sigmoid function appears as part of the posterior probability distribution [2].  That's beyond where I want to go for this blog post, so if you want to find out more, chase down the reference.

Mathematically, the sigmoid function is:

\[\sigma(x) = \dfrac{1}{1 + e^{-x}} \]

And graphically, it looks like this:

I’ve shown the sigmoid function in one dimension, as a function of \(x\). It’s important to realize that the sigmoid function can have multiple parameters (e.g. \(\sigma(x, y, z)\)), it’s just much, much harder to draw.

The sigmoid and the odds

We can write the odds as:

\[odds = \dfrac{1}{1-p}\]

Taking the natural log of both sides (this is called the logit function):

\[ln(odds) = ln \left( \dfrac{1}{1-p} \right)\]

In machine learning, we're building a prediction function from \(n\) features \(x\), so we can write:

\[\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n\]

For reasons I'll explain later, this is the log odds:

\[\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n = ln \left( \dfrac{1}{1-p} \right)\]

With a little tedious rearranging, this becomes:

\[p = \dfrac{1}{1 + e^{-(w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n)}}\]

Which is exactly the sigmoid function I showed you earlier.

So the probability \(p\) is modeled by the sigmoid function.

This is the "derivation" provided in most courses and textbooks, but it ought to leave you unsatisfied. The key point is unexplained,  why is the log odds the function \(w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n \)? 

The answer is complicated and relies on a Bayesian analysis [3]. Remember, logistic regression is taught before Bayesian analysis, so lecturers or authors have a choice; either divert into Bayesian analysis, or use a hand-waving derivation like the one I've used above. Neither choice is good. I'm not going to go into Bayes here, I'll just refer you to more advanced references if you're interested [4].

Sigmoid to classification

In the previous section, I told you that we calculate a probability value. How does that relate to classification? Let's take an example.

Imagine two teams, A and B playing a game. The probability of team A winning is \(p(A)\) and the probability of team B winning is \(p(B)\). From probability theory, we know that \(p(A) + p(B) = 1\), which we can rearrange as \(p(B) = 1 - p(A)\). Let's say we're running a simulation of this game with the probability \(p = p(A)\). So when p is "close" to 1, we say A will win and when p is close to 0, we say B will win. 

What do we mean by close? By "default", we might say that if \(p >= 0.5\) then we chose A and if \(p < 0.5\) we chose B. That seems sensible and it's the default choice of scikit-learn as we'll see, but it is possible to select other thresholds.

(Don't worry about the difference between  \(p >= 0.5\) and \(p < 0.5\) - that only becomes an issue under very specific circumstances.) 

Features and functions

Before we dive into an example of using logistic regression, it's worth a quick detour to talk about some of the properties of the sigmoid function. 

  • The y axis varies from 0 to 1.
  • The x axis varies from \(-\infty\) to  \(\infty\)
  • The gradient changes rapidly around \(x=0\) but much more slowly as you move away from zero. In fact, once you go past \(x=5\) or \(x=-5\) the curve pretty much flattens. This can be a problem for some models.
  • The "transition region" between \(y=0\) and \(y=1\) is quite narrow, meaning we "should" be able to assign probabilities away from \(p=0.5\) most of the time, in other words, we can make strong predictions about classification.

How logistic regression works

Calculating a cost function is key, however, it does involve some math that would take several pages and I don't want to turn this into a huge blog post. There are a number of blog posts online that delve into the details if you want more, checkout references [7, 8].

In linear regression, the method used to minimize the cost function is gradient descent (or a similar method like ADAM). That's not the case with logistic regression. Instead we use something called maximum likelihood estimation, and as its name suggests, this is based on maximizing the likelihood our model will predict the data we see. This approach relies on calculating a log likelihood function and using a gradient ascent method to maximize likelihood. This is an iterative process. You can read more in references [5, 6].

Some code

I'm not going to show you a full set of code, but I am going to show you the "edited highlights". I created an example for this blog post, but all the ancillary stuff got in the way of what I wanted to tell you, so I just pulled out the pieces I thought that would be most helpful. For context, my code generates some data and attempts to classify it.

There are multiple libraries on Python that have logistic regression, I'm going to focus on the one most people use to explore ideas, scikit-learn.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

train_test_split splits the data into a test set and training set. I'm not going to show how that works, it's pretty standard,

Machine learning algorithms tend to work better when the features are scaled. A lot of the time, this isn't an issue, but if the values of features range very, very differently, this can be an issue for the numeric algorithms. Here's an example: let's say feature 1 ranges from 0.001 to 0.002 and feature 2 ranges from 1,000,000 to 2,000,000, then we may have a problem. The solution is to scale the features over the same 0 to 1 range. Notably, scaling is also a problem for many curve fitting type algorithms too.  Here's the scaling code for my simple example:

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

Fitting is simply calling the fit method on the LogisticRegression model, so:

# Create and train scikit-learn logistic regression model
model = LogisticRegression(
random_state=random_state,
max_iter=max_iterations,
solver='liblinear'
)
# Train the model on scaled features
model.fit(features_scaled, labels)

As you might expect, max_iter stops the fitting process from going on forever. random_state controls the random number generator used; it's only applicable to some solvers like the 'liblinear' one I've used here. The solver is the type of equation solver used. There's a choice of different solvers which have different properties and are therefore good for different sorts of data, I've chosen 'liblinear' because it's simple.

fit works exactly as you think it might.

Here's how we make predictions with the test and training data sets:

test_features_scaled = scaler.transform(test_features)
train_features_scaled = scaler.transform(train_features)
train_predictions = model.predict(train_features_scaled)
test_predictions = model.predict(test_features_scaled)

This is pretty straightforward, but I want to draw your attention to the scaling going on here. Remember, we scaled the features when we created the model, so we have to scale the features when we're making predictions. 

The predict method uses a 0.5 threshold as I explained earlier. If we'd wanted another threshold, say 0.7, we would have used the predict_proba method.

We can measure how good our model is with the  function accuracy_score.

train_accuracy = accuracy_score(train_labels, train_predictions)
test_accuracy = accuracy_score(test_labels, test_predictions)

This gives a simple number for the accuracy of the train and test set predictions. 

You can get a more detailed report using classification_report:

        classification_report(test_labels, test_predictions)

This gives a set of various "correctness" measures.

Here's a summary of the stages:

  • Test/train split
  • Scaling
  • Fit the model
  • Predict results
  • Check the accuracy of the prediction.

Some issues with the sigmoid

Logistic regression is core to neural nets (it's all in the activation function), and as you know, neural nets have exploded in popularity. So any issues with logistic regression take on an outsize importance. 

Sigmoids suffers from the "vanishing gradient" problem I hinted at earlier. As \(x\) becomes more positive or negative, the \(y\) value gets closer to 0 or 1, so the gradient (first derivative) gets smaller and smaller. In turn, this can make training deep neural nets harder.

Sigmoids aren't zero centered, which can cause problems for modeling some systems.

Exponential calculations cost more computation time than other, simpler functions. If you have thousands, or evens millions of nets, that soon adds up.

Fortunately, sigmoids aren't the only game in town. There are a number of alternatives to the sigmoid, but I won't go into them here. You should just know they exist.

Beyond binary

In this post, I've talked about simple binary classification. The formula and examples I've given all revolve around simple binary splits. But what if you want to classify something into three or more buckets?  Logistic regression can be extended for more than two possible outputs and can be extended to the case where the outputs are ordered (ordinal).

In practice, we use more or less the same code we used for the binary classification case, but we make slightly different calls to the LogisticRegression function. The scikit-learn documentation has a really nice three-way classification demo you can see here: https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_multinomial.html.

What did Manus say?

Previously, I asked Manus to give me a report on logistic regression. I thought it's results were OK, but I thought I could do better. Here's what Manus did: https://blog.engora.com/2025/05/the-importance-of-logistic-regression.html, and of course, you're reading my take. 

Manus got the main points of logistic regression, but over emphasized some areas and glossed over others. It was a B+ effort I thought. Digging into it, I can see Manus reported back on the consensus of the blogs and articles out there on the web. That's fine (the "wisdom of the crowd"), but it's limited. There's a lot of repetition and low-quality content out there, and Manus reflected that. It missed nuances because most of the stuff out there did too.

The code Manus generated was good and it's explanation of the code was good. It did miss explaining some things I thought were important, but on the whole I was happy with it.

Overall, I'm still very bullish on Manus. It's a great place to start and may even be enough of itself for many people, but if you really want to know what's going on, you have to do the work.

References

[1] Sperandei S. Understanding logistic regression analysis. Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi: 10.11613/BM.2014.003. PMID: 24627710; PMCID: PMC3936971.

[2] Bishop, C.M. and Nasrabadi, N.M., 2006. Pattern recognition and machine learning (Vol. 4, No. 4, p. 738). New York: springer.

[3] https://www.dailydoseofds.com/why-do-we-use-sigmoid-in-logistic-regression/

[4] Norton, E.C. and Dowd, B.E., 2018. Log odds and the interpretation of logit models. Health services research, 53(2), pp.859-878.

[5] https://www.geeksforgeeks.org/machine-learning/understanding-logistic-regression/

[6] https://www.countbayesie.com/blog/2019/6/12/logistic-regression-from-bayes-theorem

[7] https://medium.com/analytics-vidhya/derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

[8] https://medium.com/data-science/introduction-to-logistic-regression-66248243c148

Friday, June 13, 2025

Don’t stop till you get enough – sample size in machine learning

How many samples of labeled data do you need?

It turns out, finding out how many labeled samples you need to “correctly” build a supervised machine learning (ML) model is a hard question with no clear answer. In this blog post, I’m going to run through the issues and finish with some advice for people managing ML model building.

(Canva)

Why does it matter?

Sample size plays into two big related themes for ML models:

  • Correctness. This means how correctly your model predicts results at a point in time.
  • Reliability. This means how correctly your model works over time.

Small sample sizes tend to give models that have a lower correctness and that give worse performance over time. This is all tied up with variance and the “law of small numbers”.

Let’s say your manager comes to you and asks you to build a ML model on a data set. When do you express concern at the size of the data set? When it’s 10, 100, 1000, 10000, or 100000 samples? What happens if your manager asks you to justify your concern?  

For a correct, stable model, you typically need a “big enough” data set to train with, but how much is “big enough”? 

What does sample size mean?

Before I dive into this some more, I should define what I mean by sample size. I mean the size of the labeled data set used for training a supervised machine learning model excluding cross-validation and hold out data sets. For example, if you use 20% of your data for hold outs, and 80% of your cross-validation data is training, only 0.8*0.8 = 0.64 of your data counts towards sample size. 

Why is this a hard problem?

There’s very little in the literature, there’s almost nothing in the leading books on machine learning, and it’s only mentioned in passing on machine learning courses. It’s an area of active research, which means there’s nothing packaged for easy use.

I’ve spent hours searching for and reading papers on this topic but I’ve not found anything useful. What I did find is that the field that’s most advanced is medicine. Researchers are increasingly using ML models for clinical trials and they need to know how many patients to enroll in their trials. It seems that they’re mostly using statistical tests (see below) for sample size however, some researchers are trying to develop robust statistical methods to independently estimate sample size. However, as of June 2025, there’s no consensus on the best approach. 

What do other disciplines do?

In frequentist statistics, there’s a recipe for determining sample size given significance, power, and effect size for a single comparison test (formally, a null-hypothesis test). The code exists in R and Python libraries, so all you have to do is put the numbers into a formula and you get your minimum sample size. Everyone doing randomized control tests (RCTs, AKA A/B tests) works out sample size before running a test.

The nearest statistical equivalent to ML is multi-comparison null-hypothesis tests, which is really something different, but it does give you some idea of sample size. The math is more complex and most people use something called the Bonferroni correction to go from single comparison to multi-comparison testing. To give you an idea of numbers, the table below shows the minimum sample size for a proportion z-test with a significance level of 5%, a power of 85%, a baseline proportion of 5%, and a 5% effect size, with Bonferroni correction.

Comparisons Sample size
1 272,944
2 409,416
3 545,888
4 682,360
5 818,832
...

Two things here: the sample size starts at 272,944 and it goes up for each test you add. 

Notably, the sample size for a null-hypothesis test depends on the effect size; a big effect leads to smaller tests. This is why most drug trials have sample sizes in the low hundreds, the effect they’re looking for is large. Conversely, in retail, effect sizes can be small leading to sample sizes in the high hundreds of thousands or even millions. This might be an important clue for ML sample sizes.

What rules of thumb are there?

The general consensus is, if you have n samples and f features, then n >> f. I’ve heard people talk about a 50x, 100x, or 1,000x ratio as being minimal. So, if you have 5 features, you need a minimum of 250-5000 samples. But even this crude figure might not be enough depending on the model.

What do people do in practice?

I’ve never come across a data scientist who estimates needed sample size before building a model. People use the cost function instead: if the cost function is “good enough” this suggests the sample size is good enough too. There are variations on this with people using confusion matrices, precision-recall, etc. etc. as “proxies”; if the metric is good enough the sample size is good enough.

But relying on the cost function or metrics alone isn’t enough. I’ve seen people develop models using under a hundred samples with over five features. The cost function results were OK, but as you might expect, the model wasn’t very robust and gave poor results some of the time.

Let me draw a comparison with an RCT to evaluate a new drug. All trials have an initial estimate of the sample size needed, but let’s say they didn’t and relied on metrics (e.g., fraction of patients cured). Do you think this would be OK, would you take the drug? Would you take the drug if the sample size was 10, 100, or 1000 patients? Or would you prefer there to be a robust estimate of the needed sample size?

My recommendations

The situation isn’t very satisfactory. Frequentists statistics suggests hundreds of thousands of samples which looks very different from the 50x-1,000x rule of thumb. Even the 50x-1,000x rule of thumb gives a huge range of answers. Using the cost function or metrics alone doesn’t feel very safe either.

I’m not in a position to give a robust statistical recipe to calculate sample size. All I can do is offer some advice. Take it for what it's worth.

  1. Ideally, have a sample size of at least 100,000, but make sure you have at least 1,000x as much data as you have features. If you really have to model with less than 100,000 samples, recognize you're on very slippery ground.
  2. Run a feature importance analysis. If you have many features each with a small contribution, that’s a warning sign; you should consider increasing your sample size.
  3. Regularly performance check your model and have pre-determined thresholds for taking action.

Don't stop till you get enough

I was thinking of this song when I was writing this.




Wednesday, April 23, 2025

The basics of regularization in machine learning

The problem

Machine learning models are trained on a set of sampled data (the training set). Data scientists use these trained models to make predictions from new data. For example, a recommender system might be trained on a data set of movies people have watched, then used to make recommendations on the movies people might like to watch. Key to the success of machine learning models is their accuracy; recommending the wrong movie, predicting the wrong sales volume, or misdiagnosing a medical image all have moral and financial consequences.

There are two causes of machine learning failure closely related to model training: underfitting and overfitting. 

Underfitting is where the model is too simple to correctly represent the data. The symptoms are a poor fit to the training data set. This chart shows the problem.


Years ago, I saw a very clear case of underfitting. The technical staff in a data center were trying to model network traffic coming in so they could forecast the computing power they needed. Clearly, the data wasn’t linear; it was a polynomial of at least order 2 plus a lot of noise. Unfortunately, they only knew how to do linear regression, so they tried to model the data using a series of linear regressions. Sadly, this meant their forecasts were next to useless. Frankly, their results would have been better if they’d extrapolated by hand using a pencil.

Overfitting is where the model is too complex, meaning it tries to fit noise instead of just the underlying trends. The symptoms are an excellent fit to the training data, but poor results when the model is exposed to real data or extrapolated. This chart shows the problem. The curve was overfit (the red dotted line), so when the curve is extrapolated, it produces nonsense.

In another company, I saw an analyst try to forecast sales data. He used a highly complex data set and a very, very, very complex model. It fit the data beautifully well. Unfortunately, it gave clearly wrong sales predictions for the next year (e.g., negative sales). He tweaked the model and got some saner predictions, unfortunately as it turned out, his predictions were way off. He had overfit his data, so when you extrapolated to the next year, it gave nonsense. When he tweaked his model, it gave less less obviously bad results, but because it overfit, it’s forecast was very wrong.

Like all disciplines, machine learning has a set of terminology aimed at keeping outsiders out. Underfitting is called bias and overfitting is called variance. These are not helpful terms in my view, but we’re stuck with them. I’m going to use the proper terminology (bias and variance) and the more straightforward terms (underfitting and overfitting) for clarity in this blog post.

Let’s look at how machine learning copes with this problem by using regularization.

Regularization

Let’s start with a simple machine linear learning model where we have a set of \(m\) features (\(X = {x_1, x_2, ...x_m}\)) and we’re trying to model a target variable \(y\) with \(n\) observations. \(\hat{y}\) is our estimate of \(y\) using the features \(X\), so we have:

\[\hat{y}^{(i)} = wx^{(i)} + b\]

Where i varies from 1 to \(n\).

The cost function is the difference between our model predictions and the actual values. 

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^{(i)} )^2\]

To find the model parameters \(w\), we minimize the cost function (typically, using gradient descent, Adam, or something like that). Overfitting manifests itself when some of the \(w\) parameters are too big. 

The idea behind regularization is that it introduces a penalty for adding more complexity to the model, which means keeping the \(w\) values as small as possible. With the right choices, we can make the model fit the 'baseline' without being too distracted by the noise.

As we'll see in a minute, there are several different types of regularization. For the simple machine learning model we're using here, we'll use the popular L2 form of regularization. 

Regularization means altering the cost function to penalize more complicated models. Specifically, it introduces an extra term to the cost function, called the regularization term.

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} w_{j}^{2}\]

\(\lambda\) is the regularization parameter and we set \(\lambda > 0\). Because \(\lambda > 0\) we're penalizing the cost function for higher values of \(w\), so gradient descent will tend to avoid them when we're minimizing. The regularization term is a square term; this modified cost function is a ridge regression or L2 form of regularization.

You might think that regularization would reduce some of the \(w\) parameters to zero, but in practice, that’s not what happens. It reduces their contribution substantially, but often not totally. You can still end up with a model that’s more computationally complex than it needs to be, but it won’t overfit.

You probably noticed the \(b\) values appeared in the model but not in the cost function or the regularized cost function. That's because in practice, the \(b\) value makes very little difference, but it does complicate the math, so I'm ignoring it here to make our lives easier.

Types of regularization

This is the ridge regression or L2 form of regularization (that we saw in the previous section):

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} w_{j}^{2}\]

The L1 form is a bit simpler, it's sometimes known as lasso which is an acronym meaning Least Absolute Shrinkage and Selection Operator.

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} |w_{j}|\]

Of course, you can combine L1 and L2 regularization, which is something called elastic net regularization. It's more accurate than L1 and L2, but the computational complexity is higher.

A more complex form of regularization is entropy regularization which is used a lot in reinforcement learning.

For most cases, the L2 form works just fine.

Regularization in more complex machine learning models - dropping out

Linear machine learning models are very simple, but that about logistic models or the more complex neural nets? As it turns out, regularization works for neural nets and other complex models too.

Overfitting in neural nets can occur due to "over-reliance" on a small number of nodes and their connections.  To regularize the network, we randomly drop out nodes during the raining process, this is called drop out regularization, and for once, we have a well-named piece of jargon. The net effect of drop out regularization is a "smoother" network that models the baseline and not the noise.

Regularization in Python

The scikit-learn package has the functionality you need. In particular, check out the Lasso,  Ridge, ElasticNet and GridSearchCV functions. Dropout regularization in neural networks is a bit more complicated and in my view it needs a little more standardization in the libraries (which is a fancy way of saying, you'll need to check the current state of the documents).

Seeking \(\lambda\)

Given that \(\lambda\) is a hyperparameter and important, how do we calculate it? The answer is using cross-validation. We can either set up a search or step through various \(\lambda\) values to see which values minimize the cost function. This probably doesn't seem very satisfactory to you and frankly, it isn't. How to cheaply find \(\lambda\) is an area of research so maybe we'll have better answers in a few years' time. 

The bottom line

Underfitting (bias) and overfitting (variance) can kill machine learning models (and models in general). Regularization is a powerful method for preventing these problems. Despite the large equations, it's actually quite easy to implement. 

Monday, November 8, 2021

Football crazy: predicting English Premier League football match results

I can get a qualification and be rich!

A long time ago, I was part of a gambling syndicate. A friend of mine had some software that predicted the results of English football (soccer) matches and at the time, betting companies offered fixed-price odds for certain types of bets. My friend noticed his software predicted 3-2 away wins more often than the betting company's odds would suggest. Over the course of a season, we had a 20% return on our gambling investment. 

During the COVID lockdown, I took the opportunity to learn R and did a long course that included a capstone project. I decided to see if I could forecast English Premier League (EPL) matches. If I succeeded, I could get a qualification and get rich too! What's not to like? Here's the story of what I did and what happened.

Premier League data

There's an eighteenth-century recipe for a hare dish that supposedly includes the instructions "First, catch your hare." The first step in any project like this is getting your data.

I got match results going back to the start of the league (1993) from football-data. The early data is only match results, but later data includes red cards and some other measurements.

TransferMarkt has data on transfer fees, foreign-born players, and team age, but the data's only available from 2011.

At the time of the project, I couldn't find any other free data sources. There were and are paid-for sources, but they were way beyond what I was willing to pay.

I knew going into the next phase of the project that this wasn't a very big data set with not that many fields. As it turned out, data was a severely limiting factor.

What factors are important?

I had a set of initial hypotheses for factors that might be important for final match scores, here are most of them:

  • team cost - more expensive teams should win more games
  • team age - younger teams perform better
  • prior points - teams with more points win against teams with fewer points
  • foreign-born players - the more non-English players on the team, the more the team will win
  • previous match results - successful (winning) teams win more matches
  • home-field advantage
  • disciplinary record - red and yellow card history might be an indicator of risk-taking
  • season effects - as the season wears on, teams take more risks to win matches

I found evidence that most of these did in fact have an impact.

Here's strong evidence of home-field advantage. Note how it goes away during the 2020-2021 season when matches were played without fans.

Here's goal difference vs. team cost difference. The more expensive team tends to win.

Here's goal difference vs. mean prior goal difference. Teams that scored more goals before tend to score more goals in their current match.

I found more relationships you can read about if you're interested.

Machine learning

Thinking back to my gambling syndicate, I decided to forecast the score of each match rather than just win/lose/draw. My loss function was the RMSE of the goal difference between the predicted score and the actual score. To avoid COVID oddities, I removed the 2020-2021 season (the price being a smaller data set). Of course, I used a training and holdout dataset and cross-validation. 

The obvious question is, which model machine learning models work? I decided to try a whole bunch of them:

  • Naive mean score model. A simple model that’s just the mean scores of the (training) data set.
  • Generalized Linear Model. A form of ordinary linear regression.
  • Glmnet. Fits lasso and elastic-net regularized generalized linear models.
  • SVM. Support Vector Machines - boundary-based regression. After some experimentation, I selected the svmRadial form of SVM, which uses a non-linear kernel function.
  • KNN. K-nearest neighbors. Given that EPL scores are all in close proximity to one another, we might expect this model to return good results.
  • Neural nets.
  • XGB Linear. This is linear modeling with extreme gradient boosting. Extreme gradient boosting has gathered a lot of attention over the last few years and may be one of the most used machine learning models today.
  • XGB Tree. This is a decision tree model with extreme gradient boosting.
  • Random Forest.

The model results weren't great. For the KNN model, here's how the RMSE for full-time away goals varied with n.

Note the RMSE scale - the lowest it goes to is 1.1 goals and it's plain that adding more n will only take us a little closer to 1.1. Bear in mind, football is a low-scoring game, and being off by 1 goal is a big miss.

It was the same story for random forest.

In fact, it was the same story for all of the models. Here are my final results. My model forecast home goals and away goals.

The naive means model is the simplest and all my sophisticated models could do is give me a few percentage points improvement.

Improving the results

Perhaps the most obvious way forward is combining models to improve RMSE. I'm reluctant to do that until I can get better individual model results. There's a philosophical issue at play; for me, the ensemble approach feels a bit "spray and pray".

In my view, data shortage is the main problem:

  • My data set was only in the low thousands of matches. 
  • Some teams join the Premier League for just a season and then get relegated - I don't model their history prior to joining the league. 
  • I removed the COVID season of 2020-2021. 
  • I only had team value and disciplinary data for ten or so seasons. 
  • Of course, I only modeled the Premier League.

Football is a low-scoring game, famous for its upsets. It may well be that it's just too random underneath to make useful predictions at the individual match level. 

What's next?

I wasn't able to predict EPL results with any great accuracy, but I submitted my report and got my grade. If you want to read my report, you can read it here.

At the end of the 2021 season, I saw some papers published on the COVID effect on match results. I had similar results months before. Perhaps I should have submitted a paper myself.

At some point, I might revive this project if I can get new data. I still occasionally hunt for new data sources, but sadly, I haven't found any. My dreams of retiring to a yacht on the Mediterranean will have to wait.

Monday, November 16, 2020

Geese or enemy aircraft? Receiver Operating Characteristic curves in machine learning

In a strange quirk of history, one of the ways of evaluating machine learning algorithms has its roots in World War II and was subsequently used in a range of disciplines, including psychiatry. Only much later was it used in machine learning, but it kept its original name: receiver operating characteristic (ROC). I'm going to look at the history of this technique and explain what it is and why it's so important.

Is it geese or is it enemy planes?

In 1940, the situation in Britain was dire; the country was engaged in a desperate stand against Hitler.  To weaken the country, and break the will of the people, Nazi aircraft heavily bombed British cities, which was the infamous blitz. I've seen estimates of over 43,000 people killed and of course, there was huge damage to Britain's industrial and cultural infrastructure. Newsreel pictures and propaganda of the time give a view of the devastation. Britain stood alone against the Nazi threat; the Battle of Britain was an existential one.

(Office workers in London going to work through bomb damage. Image source: Wikimedia Commons, License: Public Domain.)

It was vital therefore to detect enemy aircraft as quickly as possible, so the British government used a new technology called radar. Radar receivers had a number of settings, for example, you could turn the gain (amplification) up, but what should the correct settings be? Obviously, you want to correctly identify enemy aircraft, but you don't want to identify a flock of geese as aircraft. If you divert limited resources to chasing wild geese, those resources aren't available to pursue the real threat. This is where the receiver operating characteristic curve comes in. It was a way of deciding the best operating point and/or deciding the best receiver.

Ways of being right and wrong

I've covered this before in a previous blog post about the confusion matrix, so I'll just briefly recap here. There are two ways to be right and two ways to be wrong if we're doing a binary classification (geese/enemy aircraft).


Actual
enemy aircraft geese
Prediction enemy aircraft True Positive False Positive
geese False Negative True Negative

From the counts of the True Positives, False Negatives, etc. we can define two quantities:

\[TPR = \frac{TP}{TP + FN} = 1 - FNR\]
\[= True \ Positive \ Rate, sensitivity, recall, hit rate\]
\[FPR = \frac{FP}{FP + TN} = False \ Positive \ Rate, fall out\]

There are an overly large number of other quantities we can define to help us evaluate classification. But these quantities and numbers are points: they allow us to evaluate an algorithm at a point, or under a single operating condition.

A picture is worth a thousand words

The receiver operating characteristic is a plot of the True Positive Rate vs. the False Positive Rate for different settings. Generically, it looks something like this. 

We get a curve by varying a parameter and measuring FNR and TPR at each of the parameter values. In the case of our World War II radar receiver, the parameter could be gain; increasing the gain changes the trade-off between TPR and FPR. 

Let's imagine a receiver that was just a random selector - choosing geese or enemy aircraft based on the toss of a coin. We would expect it to give us a straight line at \(45^o\). Over time, the random selector would tend to the 50-50 point on the straight line. A real receiver has to do better than chance, so it has to be above the random line. In the chart below, the chance line is the black dotted line.

An ideal receiver has very different properties from the chance line. I've indicated an ideal operating curve in red on the chart below - it always gives a 100% True Positive Rate.

The ROC chart allows us to compare the behavior of different algorithms or different receivers. We could draw out the ROC curve for two receivers for example and choose the best one (the highest line). Here's a graphical representation.

A more mathematical way of doing the same thing is to use the ROC curves, but work out an area under the curve (AUC). An ideal receiver has an AUC of 1 (the red line), but obviously, the higher the AUC, the better.

Machine learning

Classifiers enable us to make categorical decisions based on input data. For example, if a user types 'evening wear' into a shopping site, do you show them cocktail dresses or tuxedos? A machine learning algorithm might use the users' browsing behavior to make a guess about male or female clothing. But how correct is the algorithm? This is where ROC curves can be used to understand the degree of correctness and the appropriate algorithmic settings to use.

Uses of ROC curves outside of machine learning

ROC curves are used in a wide range of disciplines:

More tongue-in-cheek, a group of medical researchers in Sydney, Australia used a ROC to find the optimal walking speed for men over 70 to avoid death. If you're interested, the optimal speed is 0.82m/s. 

Limitations of ROC curves - precision-recall

In a previous blog post, I looked at the confusion matrix and talked about prevalence. The idea is simple: a biased data set can give you a false sense of the accuracy of your data. If your data is biased, a precision-recall plot may be more appropriate.

Going back to the confusion matrix, here's how we define precision and recall.

\[Precision = \frac{TP}{TP + FP}\]
\[Recall = \frac{TP}{TP + FN}\]

Here's a typical precision-recall curve.

Because precision gives us an indication of how relevant the results are, precision-recall curves are often used to evaluate information retrieval algorithms.

Despite the long track record for receiver operating characteristic curves, precision-recall curves may be a better evaluation method. However, old habits die hard and ROC curves still reign.

Don't lose sight of the end goal

ROC and precision-recall curves are all about the same thing: figuring out how useful an algorithm is. There are lots of different ways an algorithm can be wrong, which means different ways of investigating correctness. Don't lose sight of the fact that under the hood, machine learning algorithms are probabilistic.

Reading more

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S1481803500013336

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.7628&rep=rep1&type=pdf

Monday, November 9, 2020

Dazed and confused: the confusion matrix and getting it right and wrong

How correct are my (machine learning) algorithms?

In machine learning, we're using algorithms to make predictions about outcomes based on input data. For example, given that a consumer at an online store views dog collars and dog leads, you might show them dog food if they search for 'pet food'. This is fairly obvious, but what if they then searched for evening wear, would you show them cocktail dresses or tuxedos? 

(The confusion matrix can be confusing. Image source: Pixabay. Author: Erika Wittlieb. License: Pixabay license)

The confusion matrix is about quantifying the correctness of algorithms, but it's not sufficient of itself. Fortunately, there are quantities we can derive from the confusion matrix that will show up certain types of error as we'll see.

The confusion matrix

I'm going to use the example of an online store that sells pet products. Imagine an algorithm that tries to decide if a consumer has a cat or not. There are two ways the algorithm can be right and two ways the algorithm can be wrong. I'll draw it out as a matrix so you can see it a bit more easily. In reality, we might put counts of false negatives, etc. in the matrix.


Actual
cat not cat
Prediction cat True Positive False Positive
not cat False Negative True Negative

All of this sounds great. It looks like we can define some rates and be done.  Let's start with some definitions and see where we get to.

We might want to know often we said it was a cat when it actually was a cat, in other words, when it actually was positive, how often did we say it was positive. This is called the True Positive Rate (TPR), which is defined like this (where FNR is the False Negative Rate and is similarly defined):

\[TPR = \frac{TP}{TP + FN} = 1 - FNR = sensitivity, recall, hit rate\]

On the flip side, how often did we say not cat when it really was not cat (how often did we say negative when it really was negative):

\[TNR = \frac{TN}{TN + FP} = 1 - FPR = specificity, selectivity\]

There are a whole bunch of other metrics we can similarly define and I won't belabor the point by defining them all here (it seems as if every possible combination of true/false positive/negative has a name). I'm just going to show some of them in this table to give you a flavor.


Actual Parameter
cat not cat
Prediction cat True Positive False Positive Precision (positive predictive value)
\[\frac{TP}{FP + TP}\]
False Discovery Rate
\[\frac{FP}{FP + TP}\]
not cat False Negative True Negative False Omission Rate
\[\frac{FN}{FN + TN}\]
Parameter True Positive Rate (Recall, Sensitivity)
\[\frac{TP}{TP + FN}\]
True Negative Rate (specificity)
\[\frac{TN}{TN + FP}\]
False Positive Rate \[\frac{FP}{TN + FP}\]

Be careful here; it's easy to get caught up on the names and definitions. You should focus on what this means for the correctness of your results.

We can use these metrics to help decide if our algorithms are good or not - but there are other things we need to consider.

Prevalence

One of the major issues in algorithmic bias has been prevalence. It's possible to get what seems like highly accurate results but for the results to be deeply biased by the underlying data. Again, the confusion matrix can help.

We can define the accuracy of an algorithm using this formula:

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Let's imagine we're getting a really great accuracy. We're really good at saying it's a cat when it really is a cat. Doesn't this sound like a really great algorithm? Think about your answer before moving on.

The trouble is, it could be because almost all the underlying data is cat data. Imagine 95% of the data was cats and we said cat 100% of the time. Some of the metrics in the table would look wonderful. We'd get a 95% accuracy for example!

A version of this has happened in real life with awful consequences. Some of the human datasets that machine learning algorithms are trained on are biased: for example, they are disproportionally images of white people, or even worse, white males. In 2015, Google released a photo app that classified images. It misclassified pictures of black people as Gorillas. This is just horrendous on multiple levels. The problem here might be that their training data set didn't include many pictures of non-white people. The labeling algorithms were accurate, just so long as you're white.

To test for bias in the dataset, we look at a number called prevalence which represents the fraction of the data set that's in a category. In our example, the prevalence of cats would be 0.95 and non-cats 0.05, which reveals a huge bias towards cats. This might be OK if the site was aimed at cat lovers, but not so great if the site was trying to grow non-cat sales.

If you're doing any machine learning work for public consumption, you must consider prevalence.

One number to bind them all

Precision, recall, false discovery rate... there are lots of numbers here and it gets confusing. Why don't we create one metric that binds them all together? We would like a score of 1 for this metric to represent perfection, and 0 to represent total failure. Fortunately, there is such a metric and it's called the \(F_1\) score.

I won't go into the derivation here, but I will give you the formula:

\[F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}\]

(for those of you who want a bit more, it's the harmonic mean of precision and recall). 

Even the \(F_1\) score isn't the end of it. It weighs precision and recall equally, but in reality, that might not be what we want. For example, we might consider a false positive much worse than a false negative (sending an innocent person to jail rather than setting a guilty person free for example). In these kinds of cases, there's a weighting factor \(\beta\) we can apply.

We can define \(\beta\) as:

\[\beta = \frac{TP + FP}{TP + FN}\] and we can create a revised F score as:

\[F_\beta =  \frac{(1 + \beta^2) TP}{(1 + \beta^2) TP + \beta^2FN + FP}\]

All this looks a bit familiar

By the way, there are very obvious parallels here to statistics, specifically, \(\alpha\), \(\beta\), Type I, and Type II errors. We're getting quite close to statistical tests with some of these processes, which probably isn't surprising. Sadly, similar things are called by different names in different disciplines, a nice way to keep barriers to entry high.

Snakes and pirates

Both Python and R have libraries you can use that will give you the confusion matrix and quantities derived from it. In Python, you should look at confusion_matrix in scikit-learn. In R, you need confusionMatrix from the caret package.

What's next?

The confusion matrix is just the start. There are several techniques based on it that you can use to effectively evaluate algorithms. In a future blog post, I'm going to look at something called Receiver Operating Characteristic which has a very interesting history.  The thought I want to leave you with is a simple one: the confusion matrix is a means of representing different ways of being right and wrong. You can use quantities derived from the matrix to indicate bias and to indicate correctness.