Wednesday, May 14, 2025

The Importance of Logistic Regression

Note

With the exception of this note, everything else on this blog post was automatically created by Manus. I'm providing it as an example of what you can create.

In this separate blog post, I explain how I created this report and I provide an evaluation of it.

If you wanted to get started with Manus, contact me and I'll share an invitation with you.

Mike

======================================

The Importance of Logistic Regression

Logistic regression stands as a cornerstone in the field of machine learning and statistics, primarily recognized for its efficacy in tackling binary classification problems. Its importance stems from a combination of its interpretability, efficiency, and the foundational understanding it provides for more complex algorithms. Unlike linear regression, which predicts continuous outcomes, logistic regression is specifically designed to predict the probability of an instance belonging to a particular class, typically one of two (e.g., yes/no, true/false, 0/1). This probabilistic output is crucial in many real-world scenarios where a clear-cut decision boundary is needed, but an understanding of the likelihood of each outcome is also valuable.

One of the key reasons for logistic regression’s widespread adoption is its relative simplicity and ease of implementation. It serves as an excellent starting point for individuals venturing into predictive modeling and classification tasks. The mathematical underpinnings, while involving concepts like the sigmoid function and log-odds, are generally more accessible than those of more sophisticated models like neural networks or support vector machines. This accessibility does not, however, detract from its power. Logistic regression can provide robust and accurate predictions, especially when the relationship between the independent variables and the log-odds of the dependent variable is approximately linear.

Furthermore, the interpretability of logistic regression models is a significant advantage. The coefficients derived from a trained logistic regression model can be directly interpreted in terms of the odds ratio. This allows practitioners to understand the influence of each independent variable on the likelihood of the outcome. For instance, in a medical diagnosis scenario, a logistic regression model can not only predict the probability of a patient having a certain disease but also quantify how factors like age, weight, or specific test results contribute to that probability. This level of insight is invaluable in fields where understanding the ‘why’ behind a prediction is as important as the prediction itself.

Logistic regression is also computationally efficient, making it suitable for large datasets and real-time applications. Training a logistic regression model is generally faster compared to more complex algorithms, and making predictions is also quick. This efficiency, combined with its good performance on many binary classification tasks, makes it a go-to algorithm for a wide range of applications. These applications span various domains, including medical diagnosis (e.g., predicting disease presence), finance (e.g., credit scoring, fraud detection), marketing (e.g., predicting customer churn or purchase likelihood), and social sciences (e.g., predicting voting behavior).

Moreover, logistic regression serves as a fundamental building block for understanding more advanced classification techniques. Many concepts introduced in logistic regression, such as the use of a link function (the sigmoid function), maximum likelihood estimation for parameter fitting, and the evaluation of model performance using metrics like accuracy, precision, recall, and AUC-ROC, are transferable to other machine learning algorithms. Therefore, a solid grasp of logistic regression provides a strong foundation for learning and applying more complex models.

In summary, the importance of logistic regression is multifaceted. It is a powerful yet relatively simple and interpretable classification algorithm that provides probabilistic outputs. Its computational efficiency, wide range of applications, and its role as a foundational concept in machine learning solidify its place as an essential tool in the data scientist’s and statistician’s toolkit. Whether used as a standalone model or as a baseline for comparison with more complex methods, logistic regression continues to be a highly relevant and valuable technique in the world of data analysis and predictive modeling.

The Theory and Math Behind Logistic Regression

Logistic regression, despite its name, is a statistical model used for binary classification tasks, meaning it predicts the probability of an instance belonging to one of two classes. The core idea is to model the probability that a given input point belongs to a certain class. To understand its mechanics, we need to delve into concepts like the odds, the logit function, the sigmoid (or logistic) function, and the method of maximum likelihood estimation for fitting the model.

From Linear Regression to Probabilities

Linear regression predicts a continuous output, y, based on a linear combination of input features, X. The equation for a simple linear regression with one feature is y = β₀ + β₁x. For multiple features, this becomes y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ. However, the output of linear regression can range from -∞ to +∞, which is not suitable for probabilities that must lie between 0 and 1.

To address this, logistic regression transforms the linear combination of inputs using a function that maps any real-valued number into the (0, 1) interval. This function is the sigmoid function, also known as the logistic function.

The Sigmoid (Logistic) Function

The sigmoid function is defined as:

σ(z) = 1 / (1 + e^(-z))

Here, ‘z’ represents the linear combination of input features and their corresponding coefficients (weights): z = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ. The output of the sigmoid function, σ(z), is the estimated probability P(Y=1|X), i.e., the probability that the dependent variable Y is 1 (e.g., ‘pass’, ‘yes’, ‘disease present’) given the input features X. As z approaches +∞, e^(-z) approaches 0, and σ(z) approaches 1. Conversely, as z approaches -∞, e^(-z) approaches +∞, and σ(z) approaches 0. This S-shaped curve is ideal for modeling probabilities.

Odds and Log-Odds (Logit)

To understand the derivation of the logistic regression model, it’s helpful to consider the concept of odds. The odds of an event occurring is the ratio of the probability of the event occurring to the probability of it not occurring:

Odds = P(Y=1|X) / P(Y=0|X)

Since P(Y=0|X) = 1 - P(Y=1|X), we can write:

Odds = P(Y=1|X) / (1 - P(Y=1|X))

If we let p(X) = P(Y=1|X) = σ(z) = 1 / (1 + e^(-z)), then:

1 - p(X) = 1 - [1 / (1 + e^(-z))] = (1 + e^(-z) - 1) / (1 + e^(-z)) = e^(-z) / (1 + e^(-z))

So, the odds become:

Odds = [1 / (1 + e^(-z))] / [e^(-z) / (1 + e^(-z))] = 1 / e^(-z) = e^z

Now, taking the natural logarithm of the odds gives us the log-odds, also known as the logit function:

logit(p(X)) = ln(Odds) = ln(e^z) = z

Thus, we have:

ln(p(X) / (1 - p(X))) = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ

This equation shows that the log-odds of the outcome is a linear function of the input features. This is the fundamental relationship that logistic regression models. The coefficients (β) can be interpreted in terms of the change in log-odds for a one-unit change in the corresponding feature, holding other features constant. Exponentiating a coefficient gives the odds ratio.

Model Fitting: Maximum Likelihood Estimation (MLE)

Unlike linear regression, where coefficients are typically estimated using Ordinary Least Squares (OLS), logistic regression coefficients are estimated using Maximum Likelihood Estimation (MLE). MLE is a method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the given data.

For a dataset with ‘n’ independent observations {(xᵢ, yᵢ)}, where xᵢ is the vector of features for the i-th observation and yᵢ is its binary outcome (0 or 1), the likelihood function L(β) is the product of the probabilities of observing each yᵢ given xᵢ and the parameters β:

L(β) = Πᵢ [p(xᵢ) ^ yᵢ] * [(1 - p(xᵢ)) ^ (1 - yᵢ)]

where p(xᵢ) = σ(β₀ + β₁x₁ᵢ + … + βₚxₚᵢ) is the predicted probability for the i-th observation.

It is often easier to work with the log-likelihood function, ll(β), because it converts the product into a sum:

ll(β) = ln(L(β)) = Σᵢ [yᵢ * ln(p(xᵢ)) + (1 - yᵢ) * ln(1 - p(xᵢ))]

Substituting p(xᵢ) = 1 / (1 + e^(-zᵢ)) and 1 - p(xᵢ) = e^(-zᵢ) / (1 + e^(-zᵢ)), where zᵢ = β₀ + β₁x₁ᵢ + … + βₚxₚᵢ, the log-likelihood becomes:

ll(β) = Σᵢ [yᵢ * zᵢ - ln(1 + e^(zᵢ))]

To find the values of β that maximize this log-likelihood function, we typically use iterative optimization algorithms like Gradient Ascent (since we are maximizing) or Newton-Raphson. These algorithms start with initial estimates for β and iteratively update them until the log-likelihood converges to a maximum. There is no closed-form solution for the β coefficients in logistic regression, unlike in linear regression.

Assumptions of Logistic Regression

While logistic regression is more flexible than linear regression, it still relies on a few key assumptions:

  1. Binary Dependent Variable: The dependent variable must be binary or dichotomous (e.g., 0/1, yes/no). For more than two categories, extensions like multinomial or ordinal logistic regression are used.
  2. Independence of Observations: The observations should be independent of each other. This is a common assumption for many statistical models.
  3. Linearity of Log-Odds: The relationship between the independent variables and the log-odds of the outcome is assumed to be linear. This can be checked using techniques like the Box-Tidwell test or by plotting residuals.
  4. Absence of Multicollinearity: There should be little or no multicollinearity among the independent variables. High multicollinearity can make it difficult to estimate the individual effects of the predictors.
  5. Large Sample Size: Logistic regression typically requires a reasonably large sample size to achieve stable and reliable estimates of the coefficients.

Understanding these theoretical and mathematical underpinnings is crucial for effectively applying logistic regression, interpreting its results, and diagnosing potential issues.

Worked Example: Logistic Regression in Python

This section provides a practical, step-by-step demonstration of how to implement logistic regression using Python. We will leverage popular libraries such as pandas for data manipulation, scikit-learn for machine learning tasks including model building and evaluation, and numpy for numerical operations. For this example, we will use the well-known Breast Cancer Wisconsin (Diagnostic) dataset, which is conveniently available within scikit-learn. This dataset presents a binary classification problem: predicting whether a breast mass is malignant or benign based on several computed features from digitized images of fine needle aspirates (FNA).

1. Importing Necessary Libraries

The first step in any Python-based data science task is to import the required libraries. We will need pandas for creating and managing DataFrames, numpy for numerical computations (though its direct use might be minimal here, it underpins scikit-learn), and several modules from scikit-learn for data splitting, model implementation, preprocessing, and metrics.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_breast_cancer # Using a built-in dataset for simplicity

2. Loading and Exploring the Dataset

We load the breast cancer dataset using load_breast_cancer() from sklearn.datasets. The data and feature names are then used to create a pandas DataFrame for easier manipulation and inspection. The target variable, indicating whether a tumor is malignant (1) or benign (0), is added as a new column to this DataFrame.

# Load the dataset
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target

Before proceeding with modeling, it is crucial to perform some initial exploratory data analysis (EDA). We display the first few rows of the DataFrame using df.head() to get a feel for the data, df.info() to understand the data types and check for missing values, and df["target"].value_counts() to see the distribution of the target classes.

print("--- Dataset Head ---")
print(df.head())
print("\n--- Dataset Info ---")
df.info()
print("\n--- Target Value Counts ---")
print(df["target"].value_counts())

This initial exploration helps confirm that the dataset is loaded correctly, identify the nature of the features (all appear to be numerical in this case), and understand the balance of the classes in the target variable, which is important for classification tasks.

3. Defining Features and Target Variable

Next, we separate the dataset into features (independent variables, denoted as X) and the target variable (dependent variable, denoted as y). X will contain all columns except the ‘target’ column, and y will consist solely of the ‘target’ column.

# Define features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

4. Splitting Data into Training and Testing Sets

To evaluate the performance of our logistic regression model on unseen data, we split the dataset into a training set and a testing set. The model will be trained on the training set, and its predictive performance will be assessed on the testing set. We use train_test_split from sklearn.model_selection for this purpose. A common split is 80% for training and 20% for testing. Setting random_state ensures that the split is the same every time the code is run, making the results reproducible. The stratify=y argument ensures that the proportion of the target classes is maintained in both the training and testing sets, which is particularly important for imbalanced datasets.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\n--- Shape of Training Data ---")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"--- Shape of Testing Data ---")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

5. Feature Scaling

Many machine learning algorithms, including logistic regression (especially when using certain solvers like ‘lbfgs’ or when regularization is applied), perform better when the input numerical features are on a similar scale. Feature scaling standardizes the range of independent variables. We use StandardScaler from sklearn.preprocessing, which standardizes features by removing the mean and scaling to unit variance. The scaler is fit only on the training data to prevent data leakage from the test set, and then used to transform both the training and testing data.

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

6. Initializing and Training the Logistic Regression Model

With the data prepared, we can now initialize and train our logistic regression model. We create an instance of the LogisticRegression class from sklearn.linear_model. For this example, we specify the solver="liblinear", which is a good choice for smaller datasets and binary classification, and set random_state for reproducibility. The max_iter parameter is increased to ensure the solver has enough iterations to converge. The model is then trained using the fit() method with the scaled training features (X_train_scaled) and the training target variable (y_train).

# Initialize and train the Logistic Regression model
log_reg_model = LogisticRegression(solver="liblinear", random_state=42, max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)

print("\n--- Model Training Complete ---")

7. Making Predictions

Once the model is trained, we can use it to make predictions on the test set (X_test_scaled). The predict() method returns the predicted class labels (0 or 1 in this case). We also use the predict_proba() method to obtain the predicted probabilities for each class. This provides the likelihood of an instance belonging to class 0 (benign) and class 1 (malignant).

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_scaled)
y_pred_proba = log_reg_model.predict_proba(X_test_scaled) # Get probabilities

print("\n--- Predictions Made ---")

8. Evaluating the Model

Model evaluation is crucial to understand how well our logistic regression model performs. We use several common metrics for classification tasks:

  • Accuracy: This is the proportion of correctly classified instances. It is calculated using accuracy_score.
  • Confusion Matrix: This table provides a detailed breakdown of correct and incorrect classifications for each class (True Positives, True Negatives, False Positives, False Negatives). It is generated using confusion_matrix.
  • Classification Report: This report, generated by classification_report, includes precision, recall, F1-score, and support for each class. These metrics provide a more nuanced view of performance, especially if the classes are imbalanced.
    • Precision measures the accuracy of positive predictions (TP / (TP + FP)).
    • Recall (or Sensitivity) measures the model’s ability to identify all actual positives (TP / (TP + FN)).
    • F1-score is the harmonic mean of precision and recall, providing a single score that balances both.
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")

conf_matrix = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_matrix}")

class_report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print(f"\nClassification Report:\n{class_report}")

The output of these evaluations will indicate the model’s effectiveness. For instance, a high accuracy and balanced precision/recall scores suggest good performance.

9. Interpreting Predicted Probabilities

To further understand the model’s output, we can look at the predicted probabilities for a few samples from the test set. This shows the model’s confidence in its predictions.

# Display some predicted probabilities for the first few test samples
print("\n--- Predicted Probabilities for first 5 test samples (Benign, Malignant) ---")
for i in range(5):
    print(f"Sample {i+1}: Actual={y_test.iloc[i]}, Predicted Proba={y_pred_proba[i]}, Predicted Class={y_pred[i]}")

Each row in y_pred_proba contains two probabilities: the first for class 0 (benign) and the second for class 1 (malignant). The predict() method typically assigns the class with the higher probability (usually based on a 0.5 threshold).

10. Interpreting Model Coefficients

Finally, we can examine the coefficients (weights) learned by the logistic regression model. These coefficients indicate the relationship between each feature and the log-odds of the outcome. A positive coefficient suggests that an increase in the feature’s value increases the log-odds of the outcome being class 1 (malignant), while a negative coefficient suggests the opposite. We can also exponentiate these coefficients to get odds ratios, which are often easier to interpret. An odds ratio greater than 1 means the odds of the outcome (malignant) increase with an increase in the feature, while an odds ratio less than 1 means the odds decrease.

# Interpreting Coefficients
coefficients = pd.DataFrame(log_reg_model.coef_[0], X.columns, columns=["Coefficient"])
print("\n--- Model Coefficients (Log-Odds) ---")
print(coefficients.sort_values(by="Coefficient", ascending=False))

odds_ratios = np.exp(log_reg_model.coef_[0])
odds_ratios_df = pd.DataFrame(odds_ratios, X.columns, columns=["Odds Ratio"])
print("\n--- Model Odds Ratios ---")
print(odds_ratios_df.sort_values(by="Odds Ratio", ascending=False))

This step provides insights into which features are most influential in the model’s predictions. It is important to remember that these interpretations are based on the scaled features if feature scaling was applied.

This worked example covers the end-to-end process of applying logistic regression, from data loading and preprocessing to model training, evaluation, and basic interpretation. The specific results (accuracy, coefficients, etc.) will depend on the dataset and the chosen parameters, but the methodology remains consistent.

# Python Worked Example for Logistic Regression

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_breast_cancer # Using a built-in dataset for simplicity

# Load the dataset
# The breast cancer dataset is a classic binary classification dataset.
# Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
# They describe characteristics of the cell nuclei present in the image.
# The target variable is whether the mass is malignant (1) or benign (0).
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target

print("--- Dataset Head ---")
print(df.head())
print("\n--- Dataset Info ---")
df.info()
print("\n--- Target Value Counts ---")
print(df["target"].value_counts())

# Define features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

# Split the data into training and testing sets
# We use 80% of the data for training and 20% for testing.
# random_state is set for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\n--- Shape of Training Data ---")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"--- Shape of Testing Data ---")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

# Feature Scaling
# Logistic regression can benefit from feature scaling, especially when using solvers that are sensitive to feature magnitudes.
# StandardScaler standardizes features by removing the mean and scaling to unit variance.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model
# We use a simple logistic regression model with default parameters for this example.
# max_iter is increased to ensure convergence for some solvers.
log_reg_model = LogisticRegression(solver="liblinear", random_state=42, max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)

print("\n--- Model Training Complete ---")

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_scaled)
y_pred_proba = log_reg_model.predict_proba(X_test_scaled) # Get probabilities

print("\n--- Predictions Made ---")

# Evaluate the model
# Accuracy: The proportion of correctly classified instances.
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")

# Confusion Matrix: A table showing the performance of a classification model.
# Rows represent the actual classes, and columns represent the predicted classes.
# TN | FP
# FN | TP
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_matrix}")

# Classification Report: Provides precision, recall, F1-score, and support for each class.
# Precision: TP / (TP + FP) - Ability of the classifier not to label as positive a sample that is negative.
# Recall (Sensitivity): TP / (TP + FN) - Ability of the classifier to find all the positive samples.
# F1-score: 2 * (Precision * Recall) / (Precision + Recall) - Weighted average of Precision and Recall.
# Support: The number of actual occurrences of the class in the specified dataset.
class_report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print(f"\nClassification Report:\n{class_report}")

# Display some predicted probabilities for the first few test samples
print("\n--- Predicted Probabilities for first 5 test samples (Benign, Malignant) ---")
for i in range(5):
    print(f"Sample {i+1}: Actual={y_test.iloc[i]}, Predicted Proba={y_pred_proba[i]}, Predicted Class={y_pred[i]}")

# Interpreting Coefficients (Optional, but good for understanding)
# The coefficients represent the change in the log-odds of the outcome for a one-unit increase in the predictor variable,
# holding other variables constant.
coefficients = pd.DataFrame(log_reg_model.coef_[0], X.columns, columns=["Coefficient"])
print("\n--- Model Coefficients (Log-Odds) ---")
print(coefficients.sort_values(by="Coefficient", ascending=False))

# To get odds ratios, we can exponentiate the coefficients
odds_ratios = np.exp(log_reg_model.coef_[0])
odds_ratios_df = pd.DataFrame(odds_ratios, X.columns, columns=["Odds Ratio"])
print("\n--- Model Odds Ratios ---")
print(odds_ratios_df.sort_values(by="Odds Ratio", ascending=False))

print("\n--- End of Worked Example ---")

References

  1. GeeksforGeeks. (2025, February 3). Logistic Regression in Machine Learning. GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/understanding-logistic-regression/
  2. Rai, K. (2020, June 14). The math behind Logistic Regression. Analytics Vidhya on Medium. Retrieved from https://medium.com/analytics-vidhya/the-math-behind-logistic-regression-c2f04ca27bca
  3. Wikipedia contributors. (2024, May 9). Logistic regression. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Logistic_regression
  4. Scikit-learn developers. (n.d.). sklearn.linear_model.LogisticRegression. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  5. Scikit-learn developers. (n.d.). sklearn.datasets.load_breast_cancer. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
  6. Scikit-learn developers. (n.d.). sklearn.model_selection.train_test_split. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
  7. Scikit-learn developers. (n.d.). sklearn.preprocessing.StandardScaler. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
  8. Scikit-learn developers. (n.d.). sklearn.metrics module. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
  9. Pandas development team. (n.d.). Pandas documentation. Pandas. Retrieved from https://pandas.pydata.org/pandas-docs/stable/
  10. NumPy developers. (n.d.). NumPy documentation. NumPy. Retrieved from https://numpy.org/doc/

Monday, May 5, 2025

I can’t believe they said that! How to speak difficult truths

Comedians can tell truths others can’t

I heard something intriguing in a comedian’s podcast and it wasn’t what you might think. The host was interviewing a comedian and talking about her latest set. It was all about some very dark and harrowing things that had happened to her.  She’d managed to create a comedy set that enabled her to talk about those things and she explained how she’d structured her performance to do it. 

Although not as extreme, I’ve seen and heard comedians talk about some very difficult subjects. This isn’t new; famously, court jesters could speak truth to power and not be executed for it. The court jester appears in a modern form too. I’ve seen comedians at corporate events say some things that are very close to the bone and get away with it. 

(The Court Jester by John Watson Nicol, Public domain, via Wikimedia Commons)

This poses the question: how do comedians do it?

Trust and safety

On the podcast, the ‘harrowing’ comedian explained how she made the audience feel safe at the start of her act. The audience knew the subject matter would be difficult, but they had to trust her as their guide. She talked about how she did that: the jokes she told, her use of language, how she interacted with the audience, and so on. Only once the audience were in a position where they felt safe and they trusted her did she start on her more difficult journey.

The idea of safety also applies to the court jester and his or her modern counterparts. The jester will never be king, so they’re not a threat to the established order. In fact, the king is paying the jester, and of course, the payment could end at any time. Payments set limits on how far you can go, so the court jester knows to be concerned with audience safety too.

We can gain some insight into why audience safety is important through some of the theories of comedy.

Theories of comedy: benign violation

There’s very active research into what comedy is and why it appeals to us. Researchers have developed a multitude of theories that explain why we find different types of jokes funny, but there’s no accepted grand unified theory.

The comedy theory that’s most applicable to us here is Benign Violation theory. This theory says we find things funny that violate our expectations of reality in some way but only if they don’t feel threatening to us. Threatening can mean different things at different times, but it also gets to expectation. If I go to a stage play about a difficult subject, I might expect the play to make me cry for the characters. If I go to a comedy show, I want to laugh, not cry with empathy. In comedy, I have to feel safe with the comedian, meaning they’re not going to take me to bad emotional places.

Using this theory, we can understand how a comedian can structure an act about a time they were mugged. Let’s say there were some absurdities about the robbery itself. If the comedian talks about how awful they felt during and after the robbery and how it affected them, this is all very serious and not funny; the audience will empathize but not laugh, so it isn’t benign; the audience isn’t safe emotionally. The comedian has to remove the sting somehow which they could do by letting everyone know they were OK after the robbery. Once the audience knows it’s safe, the comedian can proceed and focus on the absurdities (AKA violations). 

If you want to see an extraordinary example of this for real, see Tig Notaro’s act about her breast cancer and double mastectomy. She places the audience in a position of safety and only then talks about what happened. She focuses on some absurdities of her experience and real life and not on the harrowing side of it, so again the audience feels safe (benign) while she talks about difficult things (violation). It’s OK to laugh, because she’s OK with it and she’s laughing with us. 

(Gage Skidmore from Peoria, AZ, United States of America, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons)

Rule breaking: unsafe audiences 

This brings us to an interesting aspect of audience safety, audience interaction. 

I’ve seen comedians pick on people in the front rows and make fun of them. For example, make fun of their occupation or partner or where they’re from etc.  This goes to another theory of comedy, superiority theory, that says we laugh at the misfortune of others. If you’re not the person the comedian is picking on, it can be very funny, but if you are, it can be very threatening.

Think for a minute how the audience feels while the comedian is looking for a new target. There’s fear because some of the humor can cut deeply. Audiences know this and can be very wary. I’ve been to comedy acts where no one wants to sit near the stage and no one will volunteer anything to the comedian. The audience don’t feel safe doing so. 

Years ago, I went to see Eddie Izzard. He started his act asking the audience questions. No one answered. At the time, comedians were known to pick on audience members, so the audience didn’t feel safe. When finally someone did answer, he made fun of their home town. Later on in his act, Eddie Izzard commented about the audience’s English reserve and not interacting, but I think he was wrong and it was something else; they didn’t feel safe engaging with him because they didn’t want to be a target.

More recently, I was at a corporate event and there was a stand-up comedian. She said some very funny things about one of the c-level execs, it was cutting because it was true. When she asked for audience interaction, she got none because no-one wanted to be her next target.

Presentations and audience safety

Years ago, I was on a presentation training course. A nurse was presenting on a technical topic about the welfare of child patients. At one point, she seemed to get very upset at a memory and it was noticeable in her presentation. The class teacher called her out on it; she said that it didn't feel appropriate in the context of the presentation. By introducing strong emotion, she'd distracted the audience from her message. This seems harsh, but the class teacher was right.

Strong emotions are difficult for audiences to deal with, especially if they aren't expecting it. Strong emotions overwhelm everything else the presenter might say. This gets to audience emotional safety. My 'harrowing' comedian put a lot of effort into making her audience feel safe before discussing difficult subjects. Most presenters don't have anything like the skill level to do that, so they should stay away from expressing strong emotions.

Expectations and safety

There’s something that’s kind of obvious but hidden and that’s audience expectations and safety. If you go to see a late-night comedian after the pubs have shut, you might expect an expletive ridden show with all kinds of adult humor, and that’s OK because you know what it is. On the other hand, you have very different expectations for a comedian performing in front of 10-year-old children. Where the safety boundaries are varies depending on the audience.

In the case of my ‘harrowing’ comedian, she made it very clear in her show’s publicity material that her show contained very difficult material. On the Tig Notaro show I saw on TV, the channel made it clear it was an adult show covering difficult themes. In my view, this is responsible and also helps the audience to feel safe.

What all this means

As a presenter, if you want the audience to interact with you, they have to trust you. Don’t demean people who volunteer, it discourages everyone else. I suggest positivity. Let’s say an audience member tells you they come from a very run down town. You could riff on crime in that town, or you could tell a benign story about the town like losing your car in a huge parking lot there. Rewarding people for engaging with you encourages more engagement.

Audiences have to feel safe with you if you’re going to push any kind of boundary, and this is especially true if any of your material is difficult. You have to let your audience know that you’re OK and they’re OK, and they’ll be OK if they go on a journey with you; you’re going to make them laugh, not cry. 

Finally, you can speak truth to power through humor, but you need to know what you're doing and what the limits are. 

Wednesday, April 23, 2025

The basics of regularization in machine learning

The problem

Machine learning models are trained on a set of sampled data (the training set). Data scientists use these trained models to make predictions from new data. For example, a recommender system might be trained on a data set of movies people have watched, then used to make recommendations on the movies people might like to watch. Key to the success of machine learning models is their accuracy; recommending the wrong movie, predicting the wrong sales volume, or misdiagnosing a medical image all have moral and financial consequences.

There are two causes of machine learning failure closely related to model training: underfitting and overfitting. 

Underfitting is where the model is too simple to correctly represent the data. The symptoms are a poor fit to the training data set. This chart shows the problem.


Years ago, I saw a very clear case of underfitting. The technical staff in a data center were trying to model network traffic coming in so they could forecast the computing power they needed. Clearly, the data wasn’t linear; it was a polynomial of at least order 2 plus a lot of noise. Unfortunately, they only knew how to do linear regression, so they tried to model the data using a series of linear regressions. Sadly, this meant their forecasts were next to useless. Frankly, their results would have been better if they’d extrapolated by hand using a pencil.

Overfitting is where the model is too complex, meaning it tries to fit noise instead of just the underlying trends. The symptoms are an excellent fit to the training data, but poor results when the model is exposed to real data or extrapolated. This chart shows the problem. The curve was overfit (the red dotted line), so when the curve is extrapolated, it produces nonsense.

In another company, I saw an analyst try to forecast sales data. He used a highly complex data set and a very, very, very complex model. It fit the data beautifully well. Unfortunately, it gave clearly wrong sales predictions for the next year (e.g., negative sales). He tweaked the model and got some saner predictions, unfortunately as it turned out, his predictions were way off. He had overfit his data, so when you extrapolated to the next year, it gave nonsense. When he tweaked his model, it gave less less obviously bad results, but because it overfit, it’s forecast was very wrong.

Like all disciplines, machine learning has a set of terminology aimed at keeping outsiders out. Underfitting is called bias and overfitting is called variance. These are not helpful terms in my view, but we’re stuck with them. I’m going to use the proper terminology (bias and variance) and the more straightforward terms (underfitting and overfitting) for clarity in this blog post.

Let’s look at how machine learning copes with this problem by using regularization.

Regularization

Let’s start with a simple machine linear learning model where we have a set of \(m\) features (\(X = {x_1, x_2, ...x_m}\)) and we’re trying to model a target variable \(y\) with \(n\) observations. \(\hat{y}\) is our estimate of \(y\) using the features \(X\), so we have:

\[\hat{y}^{(i)} = wx^{(i)} + b\]

Where i varies from 1 to \(n\).

The cost function is the difference between our model predictions and the actual values. 

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^{(i)} )^2\]

To find the model parameters \(w\), we minimize the cost function (typically, using gradient descent, Adam, or something like that). Overfitting manifests itself when some of the \(w\) parameters are too big. 

The idea behind regularization is that it introduces a penalty for adding more complexity to the model, which means keeping the \(w\) values as small as possible. With the right choices, we can make the model fit the 'baseline' without being too distracted by the noise.

As we'll see in a minute, there are several different types of regularization. For the simple machine learning model we're using here, we'll use the popular L2 form of regularization. 

Regularization means altering the cost function to penalize more complicated models. Specifically, it introduces an extra term to the cost function, called the regularization term.

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} w_{j}^{2}\]

\(\lambda\) is the regularization parameter and we set \(\lambda > 0\). Because \(\lambda > 0\) we're penalizing the cost function for higher values of \(w\), so gradient descent will tend to avoid them when we're minimizing. The regularization term is a square term; this modified cost function is a ridge regression or L2 form of regularization.

You might think that regularization would reduce some of the \(w\) parameters to zero, but in practice, that’s not what happens. It reduces their contribution substantially, but often not totally. You can still end up with a model that’s more computationally complex than it needs to be, but it won’t overfit.

You probably noticed the \(b\) values appeared in the model but not in the cost function or the regularized cost function. That's because in practice, the \(b\) value makes very little difference, but it does complicate the math, so I'm ignoring it here to make our lives easier.

Types of regularization

This is the ridge regression or L2 form of regularization (that we saw in the previous section):

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} w_{j}^{2}\]

The L1 form is a bit simpler, it's sometimes known as lasso which is an acronym meaning Least Absolute Shrinkage and Selection Operator.

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} |w_{j}|\]

Of course, you can combine L1 and L2 regularization, which is something called elastic net regularization. It's more accurate than L1 and L2, but the computational complexity is higher.

A more complex form of regularization is entropy regularization which is used a lot in reinforcement learning.

For most cases, the L2 form works just fine.

Regularization in more complex machine learning models - dropping out

Linear machine learning models are very simple, but that about logistic models or the more complex neural nets? As it turns out, regularization works for neural nets and other complex models too.

Overfitting in neural nets can occur due to "over-reliance" on a small number of nodes and their connections.  To regularize the network, we randomly drop out nodes during the raining process, this is called drop out regularization, and for once, we have a well-named piece of jargon. The net effect of drop out regularization is a "smoother" network that models the baseline and not the noise.

Regularization in Python

The scikit-learn package has the functionality you need. In particular, check out the Lasso,  Ridge, ElasticNet and GridSearchCV functions. Dropout regularization in neural networks is a bit more complicated and in my view it needs a little more standardization in the libraries (which is a fancy way of saying, you'll need to check the current state of the documents).

Seeking \(\lambda\)

Given that \(\lambda\) is a hyperparameter and important, how do we calculate it? The answer is using cross-validation. We can either set up a search or step through various \(\lambda\) values to see which values minimize the cost function. This probably doesn't seem very satisfactory to you and frankly, it isn't. How to cheaply find \(\lambda\) is an area of research so maybe we'll have better answers in a few years' time. 

The bottom line

Underfitting (bias) and overfitting (variance) can kill machine learning models (and models in general). Regularization is a powerful method for preventing these problems. Despite the large equations, it's actually quite easy to implement. 

Monday, April 21, 2025

The parakeets of Belfast

It started in London

Over ten years ago, I was in suburban London and I got a shock; I thought I'd seen a parrot flying wild. I looked again, and this time, I saw two of them. They were about 30 cm (1 ft) long, bright green, with a rose-colored ring around their necks.


(Dr. Raju Kasambe, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons - this is what they look like in their native home)

I wasn't hallucinating, what I saw were wild parakeets that had established breeding colonies in London. Formally, they are rose-ringed parakeets or Psittacula krameri. A 1997 survey found there were about 3,500 breeding pairs, with a 2012 survey finding 32,000; these numbers were for London alone. There are likely far more now.

The birds seemed to have started off in south-west London before spreading to other parts of the city. Bear in mind, London has lots of quite large parks in urban and suburban areas that are only a short flight away from each other. Lots of people put out food for the birds, so there's plenty for them to eat.

(Parakeet in Garden, London N14 by Christine Matthews, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons)

Parakeets are natively found in a band from sub-saharan Africa to India. Given that they're from a hot climate, the obvious question is, how do they survive the English winters? Part of the answer is the mild British weather; despite being quite far north, the UK climate is strongly affected by the Gulf Stream which gives cooler summers and warmer winters. It rarely snows in the south of England and it rarely gets extremely cold, which means the birds can overwinter without dying off. The other part of the answer is the parakeets' range in their home environment; they're found as far as the foothills of the Himalayas, which are obviously pretty cool.

Jimi Hendrix or The African Queen or...?

The next most obvious question is, how did these parakeets get there? There are some great legends, so I'm going to tell them.

One story says it all goes back to the movie "The African Queen" which was partly filmed in Isleworth just outside London. The legend has it, the production company shipped in parakeets for filming and then let them loose at the end of the shoot. The birds moved to Twickenham (next door to Isleworth), which they found hospitable, and they spread from there.

If you don't like that legend, then maybe you'd like one of the others. Jimi Hendrix is said to have had parakeets when he lived in London in the 1960's. One day, he decided to set them free, and so the wild parakeet population got started.

(Warner/Reprise RecordsUploaded by We hope at en.wikipedia, Public domain, via Wikimedia Commons)

There are other legends involving Henry VIII, the Great Storm of 1987, and more. You can read all about them online.

The reality is probably much more mundane. Parakeets were popular as pets. As people got bored of them, the easiest thing to do is just release them. With enough people releasing birds, you've got a viable breeding population.

Talking

Parakeets are famously noisy birds, so they just add to the din in an already noisy city. Notably, parakeets can mimic human speech very clearly and are among the best talking parrots. It's a bit odd to think there are thousands of wild birds in London capable of mimicking human speech; maybe they'll have cockney accents.

Glasgow

By 2019, the parakeets had made their way north to Glasgow and set up home in Victoria Park, and from there, they've been colonizing Scotland. The population in Glasgow had the distinction of being the most northerly parrot population anywhere in the world, but it now looks as if the birds have moved even further north.

Here's a map from the NBN Atlas (https://species.nbnatlas.org/species/NHMSYS0000530792) showing current confirmed and unconfirmed sightings of the parakeets in the UK.

Dublin

Parakeets were first spotted in Dublin around 2012. By 2020, they'd started spreading outside Dublin into the surrounding countryside and towns.

As one of the local commentators said, the fact the parakeets are bright green seems appropriate for Ireland.

How did the parakeets get to Dublin? Bear in mind, Jimi Hendrix didn't live in Dublin and "The African Queen" wasn't shot there. Of course, they could have flown there from London, but the Irish Sea is a rough sea and it's a long way to fly across open water. The most likely explanation is the most mundane: people releasing their pets when they got bored of them.

Belfast

Recently (2025), parakeets have been spotted in Belfast. Only a small population of 15 or so, but they're there. If you want to go see them, head up to the Waterworks Park in the north of the city.

They're likely to have spread up from Dublin rather than having spread across the Irish sea.

Brussels sprouts

It's not just the UK and Ireland who are host to the green invaders; there are are something like 200 populations of parakeets in Europe. Brussels has them too, plus Alexandrine parakeets and monk parakeets.

(Frank Vassen from Brussels, Belgium, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons)

It is credible that parakeets could have spread from the UK across the channel. You can clearly see France from Kent and birds regularly make the crossing. However the timing and distribution doesn't work. What's much more likely is the accidental or deliberate release of pets.

It's not just the UK, Ireland, and Belgium that have parakeets, they've spread as far as Poland (see https://www.researchgate.net/publication/381577380_Parrots_in_the_wild_in_Polish_cities). The Polish article has the map above that reports on known parakeet populations in Europe. It's a little behind the times (the Irish parakeets aren't there), but it does give you a good sense of how far they've moved.

This is not good

Despite their cuteness, they're an invasive species and compete with native bird populations. Both the UK and Ireland are considering or have considered culls, but as of the time of writing, nothing has been decided.

The key Belfast question

Are you a Catholic parakeet or a Protestant parakeet?

Wednesday, April 16, 2025

Imagination in Action: The Good, the Bad, and the Ugly

What it Was

This was an all-day conference at MIT focused on AI—covering new innovations, business implications, and future directions. There were multiple stages with numerous talks, blending academia and industry. The event ran from 7 a.m. to around 7:30 p.m., and drew roughly 1,000 attendees.

The Good

The speakers were relevant and excellent. I heard firsthand how AI is being used in large insurance companies, automotive firms, and startups—all from people actively working in the field. Industry luminaries shared valuable insights; I particularly enjoyed Anshul Ramachandran from Windsurf, and of course, Stephen Wolfram is always engaging.

The academic speakers contributed thoughtful perspectives on the future of AI. This wasn’t an “academic” conference in the traditional sense—it was firmly grounded in real-world experience.

From what I gathered, some large businesses are well along the path of AI adoption, both internally and in customer-facing applications. Many have already gone through the growing pains and ironed out the kinks.

Both Harvard and MIT are producing graduates with strong AI skills who are ready to drive results. In other words, the local talent pool is robust. (Though I did hear a very entertaining story about a so-called “AI-native” developer and the rookie mistake they made…)

The networking was excellent. I met some wonderful people—some exploring AI applications, others contemplating new ventures, and many seasoned veterans. Everyone I spoke with was appropriately senior and had thoughtful, engaging perspectives.

The Bad

Not much to complain about, but one observation stood out. I was in a smaller session where a senior speaker had just finished. As the next speaker began, the previous one started a loud conversation with another senior attendee—right by the entrance, less than an arm’s length from the door. Even after being asked to be quieter, they continued. I found this disrespectful and discourteous, especially considering their seniority. Unfortunately, I witnessed similar behavior a few other times.

The Ugly

One thing really stuck with me. Several speakers were asked about AI’s impact on employment. The answers were nearly identical: “It will change employment, but overall demand will increase, so I’m not worried.” Urghhh...

Yes, historically, new technologies have increased employment rather than reduced it—but this glosses over the pain of transition. In every technological shift, people have been left behind, often facing serious economic consequences. I’ve seen it firsthand.

Here’s a thought experiment to make the point: imagine you’ve been a clerk in a rural Alabama town for twenty years. AI takes your job. What now? The new AI-driven jobs are likely in big cities you can’t move to, requiring skills you don’t have and can’t easily acquire. Local job options are limited and pay less. For you, AI is a major negative, and no amount of job creation elsewhere will make up for it. My point is: the real world is more than just developers. We need to acknowledge that people will experience real hardship in this transition.

The Bottom Line

This was a worthwhile use of my time. It gave me a clear sense of where early adopters are with AI in business, and also helped me realize I know more than I thought. Will I return next year? Probably.

Monday, April 14, 2025

Why a lot of confidence intervals are wrong

Lots of things are proportions

In statistics, a proportion is a number that can vary from 0 to 1. Proportions come up all the time in business and here are just a few examples.

  • Conversion rates on websites (fraction of visitors who buy something).
  • Opinion poll results (e.g. fraction of businesses who think the economy will improve in the next six months).
  • Market share.
If you can show something meaningful on a pie chart, it's probably a proportion.

(Amousey, CC0, via Wikimedia Commons)

Often, these proportions are quoted with a confidence interval or margin of error, so you hear statements like "42% said they would vote for X and 44% for Y. The survey had a 3% margin of error". In this blog post, I'm going to show you why the confidence interval, or margin of error, can be very wrong in some cases.

We're going to deal with estimates of the actual mean. In many cases, we don't actually know the true (population) mean, we're estimating based on a sample. The mean of our sample is our best guess at the population mean and the confidence interval gives us an indication of how confident we are in our estimate of the mean. But as we'll see, the usual calculation of confidence interval can go very wrong.

We're going to start with some text book math, then I'm going to show you when it goes badly astray, then we're going to deal with a more meaningful way forward.

Estimating the mean and the confidence interval

Estimating the population mean is very straightforward and very obvious. Let's take a simple example to help explain the math. Imagine a town with 38,000 residents and they'll vote on whether the town government should build a new fire station or not. We'll call the actual vote results (proportion in favor of the fire station) the population mean. You want to forecast the results of the vote, so you run a survey, the proportion you get from the survey will be a sample mean. Let's say you survey 500 people (the sample size) and 350 said yes (this is the number of successes). Assuming the survey is unbiased, our best estimate (of the population mean) is give by (the sample mean):

\(\hat{p} = \dfrac{m}{n} = \dfrac{350}{500} = 0.7\)

But how certain are we of this number? If we had surveyed all 38,000 residents, we'd probably get a very, very accurate number, but the cost of the survey goes up with the number of respondents. On the other hand, if we asked 10 residents, our results aren't likely to be accurate. So how many people do we need to ask? Another way of saying this is, how certain are we that our sample mean is close to the population mean?

The textbook approach to answering this question is to use a confidence interval. To greatly simplify, the confidence interval is two numbers (an upper and lower number) between which we think there's a 95% probability the population mean lies. The probability doesn't have to be 95%, but that's the usual choice. The other usual choice is to express the confidence interval relative to the sample mean, so the lower bound is the sample mean minus a value, and the upper bound is the sample mean plus the same value. For our fire station example, we might say something like \(0.7 \pm 0.04\), which is a 4% margin of error.

Here's the formula:

\(\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)

You sometimes hear people call this the Wald interval, named after Abraham Wald. The symbol \(z_{\frac{\alpha}{2}}\) comes from the normal distribution, and for a 95% confidence interval, it's close to 1.96. This formula is an approximation. It's been used for decades because it's easy to use and cheap to calculate, which was important when computations were expensive.

Let's plug some numbers into the Wald formula as an example, To go back to our fire station opinion poll, we can put the numbers in and get a 95% confidence interval. Here's how it works out:

\(0.7 \pm 1.96 \sqrt{\dfrac{0.7(1-0.7)}{500}} = 0.7 \pm 0.04\)

So we think our survey is pretty accurate, we're 95% sure the real mean is between 0.66 and 0.74. This is exactly the calculation people use for opinion polls, in our case, our margin of error is 4%.

So far so good, but there are problems...

(The actual meaning of the confidence interval is more nuanced and more complicated. If we were to repeat the survey an infinite number of times and generate an infinite number of confidence intervals, then 95% of the confidence intervals would contain the population mean. This definition gets us into the deeper meaning of statistics and is harder to understand, so I've given the usual 'simpler' explanation above. Just be aware that this stuff gets complicated and language matters a lot.) 

It all goes wrong at the extremes - and the extremes happen a lot

What most of the textbooks don't tell you is that the formula for the confidence interval is an approximation and that it breaks down:

  • when \(\hat{p}\) is close to 0 or 1.
  • when n is small.

Unfortunately, in business, we often run into these cases. Let's take look at a conversion rate example. Imagine we run a very short test and find that from 100 website visitors, only 2 converted. We can express our conversion rate as:

\(0.02 \pm 1.96 \sqrt{\dfrac{0.02(1-0.02)}{100}} = 0.02 \pm 0.027\)

Before we go on, stop and look at this result. Can you spot the problem?

The confidence interval goes from -0.007 to  0.047. In other words, we're saying there's a probability the conversion rate can be negative. This is plainly absurd.

Let's take another example. Imagine we want to know the proportion of dog lovers there are in a town of cat lovers. We ask 25 people do they love cats or dogs and 25 of them said cats. Here's our estimate of the proportion of cat lovers and dog lovers:

Dog lovers = \(0 \pm 1.96 \sqrt{\dfrac{0.0(1-0)}{25}} = 0 \pm 0\)

Cat lovers = \(1 \pm 1.96 \sqrt{\dfrac{1(1-1)}{25}} = 1 \pm 0\)

These results suggest we're 100% sure everyone is cat lover and no-one is a dog lover. Does this really seem sensible to you? Instead of cats and dogs, imagine it's politicians. Even in areas that vote heavily for one party, there are some supporters of other parties. Intuitively, our confidence interval shouldn't be zero.

The Wald interval breaks down because it's based on an approximation. When the approximation no longer holds, you get nonsense results. 

In the next section, I'll explain how you can do better.

(I've seen "analysts" with several years' experience argue that these type of results are perfectly fine. They didn't understand the math but they were willing to defend obviously wrong results because it came out of a formula they know. This is really bad for business, Amazon would never make these kinds of mistakes and neither should your business.)

A better alternative #1: Wilson score intervals

The Wilson score interval makes a different set of approximations than the Wald interval, making it more accurate but more complicated to calculate. I'm going to avoid the theory for now an jump straight into the formula:

\(  \dfrac{\hat{p}  + \dfrac{z^2_{\frac{\alpha}{a}}}{2n} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z^2_{\frac{\alpha}{2}}}{4n^2}}}{1 + \frac{z^2_{\frac{\alpha}{2}}}{n}}\)

This is a scary looking formula and it's much harder to implement that the Wald interval, but the good news is, there are several implementations in Python. I'll show you a few, first one in statsmodels and the second one using scipy.

import numpy as np
from statsmodels.stats.proportion import proportion_confint
from scipy import stats

# Sample data
n = 100 # number of observations
k = 2 # number of successes

# Calculate Wilson score interval using statsmodels
wilson_ci = proportion_confint(k, n, alpha=0.05, method='wilson')
print("Wilson Score Interval (statsmodels):")
print(f"Lower bound: {wilson_ci[0]:.4f}")
print(f"Upper bound: {wilson_ci[1]:.4f}")

# Calculate Wilson score interval using scipy
# Wilson score interval formula implementation
wilson_ci_scipy = stats.binomtest(2, 100).proportion_ci(method='wilson')
print("\nWilson Score Interval (scipy):")
print(f"Lower bound: {wilson_ci_scipy[0]:.4f}")
print(f"Upper bound: {wilson_ci_scipy[1]:.4f}")

As you might expect, the two methods give the same results. 

For the conversion rate sample (100 visitors, 2 purchases), we get the lower interval as 0.0055 and the upper as 0.0700, which is an improvement because the lower bound is above zero.  The score interval makes sense.

For the cats and dogs example, we get for dogs: lower=0, upper=0.1332, for cats we get: lower=0.8668, upper=1. This seems much better too. We've allowed for the town to have dog lovers in it which chimes with our intuition.

The Wilson score interval has several neat properties:

  • It will never go below 0
  • It will never go above 1
  • It gives accurate answers when n is small and when \(\hat{p}\) is close to zero or 1.
  • The Wald interval will sometimes give you a single value, the Wilson score interval will always give you two (which is what you want).
  • The Wilson score interval is close to the Wald interval for large n and where \(\hat{p}\) is close to 0 or 1. 

You can read more about the Wilson score interval in this excellent blog post: https://www.econometrics.blog/post/the-wilson-confidence-interval-for-a-proportion/ Take a look at the charts, they show you that the Wilson score interval gives much more accurate results for small n and when \(\hat{p}\) is close to zero or 1.

This reference provides a fuller explanation of the theory: https://www.mwsug.org/proceedings/2008/pharma/MWSUG-2008-P08.pdf

A better alternative #2: Agresti-Coull

The Agresti-Coull is another interval like the Wilson score interval. Again, it's based on a different set of approximations and a very simple idea. The starting point is to take the data and add two success observations and two failure observations. Using the labels I gave you earlier; m is the number of success measurements and n the total number of measurements, then the Agresti-Coull interval uses m + 2 and n + 4. Here's what it looks like in code:

# Calculate Agresti-Coull score interval using statsmodels
ag_ci = proportion_confint(k, n, alpha=0.05, method='agresti_coull')
print(f"Agrest-Coulllson Score Interval (statsmodels):")
print(f"Lower bound: {ag_ci[0]:.4f}")
print(f"Upper bound: {ag_ci[1]:.4f}")

The Agresti-Coull interval is an approximation to the Wilson score interval, so unless there's a computation reason to do something different, you should use the Wilson score interval.

Other alternatives

As well as Wilson and Agresti-Coull, there are a bunch of alternatives, including Clopper-Pearson, Jeffrey (Bayesian), and more. Most libraries have a range of methods you can apply.

What to do

Generally speaking, be sure to know the limitations of all the statistical methods you use and select the right methods for your data. Don't assume that something is safe to use because "everyone" is using it. Occasionally, the methods you use will flag up junk results (e.g. implying a negative conversion rate). If this happens to you, it should be a sign that your algorithms have broken down and that it's time to go back to theory.

For proportions, if your proportion mean is "close" to 0.5 and your sample size is large (say, over 100), use the Wald interval. Otherwise, use the Wilson score interval. If you have to use one and only one method, use the Wilson score interval.

 

Tuesday, April 8, 2025

Identifying people by how they type - a unique "fist"

The other day, I read a breathless article on how AI could identify a human by what they typed and how they typed it. The idea was, each person has a unique typing "fingerprint" or "fist", meaning a combination of their speed, the mistakes they make, their pauses etc. Obviously, systems have been around for some years now that distinguish between machine typing and human typing, but the new systems go further than that; they identify individuals.

(Salino01, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

The article suggested this was something new and unique, but I'm not sure it is. Read the following paragraph and guess when it was written:

"Radio Security would be monitoring the call, as they monitored every call from an agent. Those instruments which measured the minute peculiarities in an operator's 'fist' would at once detect it wasn't Strangways at the key. Mary Trueblood had been shown the forest of dials in the quiet room on the top floor at headquarters, had watched as the dancing hands registered the weight of each pulse, the speed of each cipher group, the stumble over a particular letter. The Controller had explained it all to her when she had joined the Caribbean station five years before--how a buzzer would sound and the contact be automatically broken if the wrong operator had come on the air."

The excerpt is from Ian Fleming's Dr. No and was written in 1957 (published in 1958). However, this idea goes back further in time. I've read articles about World War II radio communication where the women working in the receiving stations could identify who was sending morse code by their patterns of transmission (using the same methods Ian Fleming talked about). There's even mention of it on a Wikipedia page and there are several online articles about ham radio operators recognizing each other's "fists". 

What AI is doing here isn't new and unique. It's doing something that's been possible for a long time but doing it more quickly and more cheaply. The latter part is the most important piece of the story, by reducing the cost, AI enables the technology to be widely used. 

In the past, the press and other commentators have missed important societal changes brought on by rapid technology cost reductions. This happened because reporters focused on technical gee-whiz 'breakthrough' stories rather than cost reduction stories The obvious example is containerization and the consequential huge reduction in shipping costs that enabled global competition in manufactured goods and from there to regional deindustrialization. Low shipping costs are one of the main reasons why we can't easily go back to the good old days of manufacturing in deindustrialized area. But how often do you see shipping costs discussed in the press? Given the press missed the impact of containerization, what are they going to miss for the impact of AI?

Journalists have limited word counts for articles. The article I read about typing "fists" should have talked about the implications of the cost reduction instead of the technical 'breakthrough' aspect. Some journalists (and newspapers) just seem to miss the point.

Tuesday, April 1, 2025

Platypus are weird

Weird stuff

I was looking on the internet for something and stumbled on some weird facts about platypus that I didn't know. I thought it would be fun to blog about it.
(Charles J. Sharp, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

Obvious weirdness

There are a couple of facts most people know about platypus, so I'll only mention them in passing:

  • They are one of the few mammals to lay eggs.
  • They have a beak, or more formally, a bill.
  • When the first samples were brought to the UK, scientists thought they were fake.
Let's get on to the more interesting facts.

Venom

Only a handful of mammals are venomous, including the platypus. The male has a venom spur on its hind legs as you can see in the image below. 

(The original uploader was Elonnon at English Wikipedia., CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons)

Biologically, it's a modified sweat gland that produces venom.  It's thought the males use these spurs to fight other males for access to females. 

The venom is quite powerful and can affect humans quite strongly. Here's an alarming description from Wikipedia:

Although powerful enough to paralyze smaller animals, the venom is not lethal to humans. Still, it produces excruciating pain that may be intense enough to incapacitate a victim. Swelling rapidly develops around the entry wound and gradually spreads outward. Information obtained from case studies shows that the pain develops into a long-lasting hyperalgesia that can persist for months but usually lasts from a few days to a few weeks. A clinical report from 1992 showed that the severe pain was persistent and did not respond to morphine.


Electrosense

The platypus' bill is filled with sensory receptors that can detect incredibly small movements in the water like those made by the freshwater shrimp it feeds on. It also has a large number of electroreceptors that can sense biological electrical signals, for example, the muscle contractions of its prey.  It can combine these two signals as a location mechanism. (See "The platypus bill, push rods and electroreception.")

No functional stomach

A stomach is an organ that secretes digestive enzymes and acids to break down food. The platypus doesn't have one. Instead, it's food goes directly to its intestines. It chews its food so thoroughly, there's not much need for digestive acids and enzymes, and it eats so frequently, there's not much need for storage. (See "Some platypus myths.")

What does platypus taste like

Platypus are protected in Australia and you can't hunt and eat them. The aboriginal people didn't eat them because of their smell. In the 1920s, some miners did eat one and reported the taste was “a somewhat oily dish, with a taste between those of red herring and wild duck”.  There's surprisingly little else published on their taste. You can read more here

What does dead platypus milk taste like?

Platypus females produce milk through their skin (they don't have nipples). This means of milk production is more susceptible to bugs, so it's probably no surprise platypus milk contains antibiotics (see this reference.)

But what does platypus milk taste like? More specifically, what does milk from a dead platypus taste like? It turns out, we actually know; it doesn't actually taste or smell of anything.

Plurals

This article goes into the subject on some depth. To cut to the chase, platypi is definitely wrong, but either platypus or platypuses are correct.

Baby platypus

Baby platypus are mostly called puggles, although there's some push back to that name.

Theme tune

Apparently, there was a Disney TV series called "Phineas and Ferb" that featured a platypus. Here's his theme song.

There aren't many other songs about platypus. The only other one I could find was "Platypus (I hate you)" by Green Day which doesn't seem to have a lot to do with Australian mammals.