Wednesday, April 23, 2025

The basics of regularization in machine learning

The problem

Machine learning models are trained on a set of sampled data (the training set). Data scientists use these trained models to make predictions from new data. For example, a recommender system might be trained on a data set of movies people have watched, then used to make recommendations on the movies people might like to watch. Key to the success of machine learning models is their accuracy; recommending the wrong movie, predicting the wrong sales volume, or misdiagnosing a medical image all have moral and financial consequences.

There are two causes of machine learning failure closely related to model training: underfitting and overfitting. 

Underfitting is where the model is too simple to correctly represent the data. The symptoms are a poor fit to the training data set. This chart shows the problem.


Years ago, I saw a very clear case of underfitting. The technical staff in a data center were trying to model network traffic coming in so they could forecast the computing power they needed. Clearly, the data wasn’t linear; it was a polynomial of at least order 2 plus a lot of noise. Unfortunately, they only knew how to do linear regression, so they tried to model the data using a series of linear regressions. Sadly, this meant their forecasts were next to useless. Frankly, their results would have been better if they’d extrapolated by hand using a pencil.

Overfitting is where the model is too complex, meaning it tries to fit noise instead of just the underlying trends. The symptoms are an excellent fit to the training data, but poor results when the model is exposed to real data or extrapolated. This chart shows the problem. The curve was overfit (the red dotted line), so when the curve is extrapolated, it produces nonsense.

In another company, I saw an analyst try to forecast sales data. He used a highly complex data set and a very, very, very complex model. It fit the data beautifully well. Unfortunately, it gave clearly wrong sales predictions for the next year (e.g., negative sales). He tweaked the model and got some saner predictions, unfortunately as it turned out, his predictions were way off. He had overfit his data, so when you extrapolated to the next year, it gave nonsense. When he tweaked his model, it gave less less obviously bad results, but because it overfit, it’s forecast was very wrong.

Like all disciplines, machine learning has a set of terminology aimed at keeping outsiders out. Underfitting is called bias and overfitting is called variance. These are not helpful terms in my view, but we’re stuck with them. I’m going to use the proper terminology (bias and variance) and the more straightforward terms (underfitting and overfitting) for clarity in this blog post.

Let’s look at how machine learning copes with this problem by using regularization.

Regularization

Let’s start with a simple machine linear learning model where we have a set of \(m\) features (\(X = {x_1, x_2, ...x_m}\)) and we’re trying to model a target variable \(y\) with \(n\) observations. \(\hat{y}\) is our estimate of \(y\) using the features \(X\), so we have:

\[\hat{y}^{(i)} = wx^{(i)} + b\]

Where i varies from 1 to \(n\).

The cost function is the difference between our model predictions and the actual values. 

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^{(i)} )^2\]

To find the model parameters \(w\), we minimize the cost function (typically, using gradient descent, Adam, or something like that). Overfitting manifests itself when some of the \(w\) parameters are too big. 

The idea behind regularization is that it introduces a penalty for adding more complexity to the model, which means keeping the \(w\) values as small as possible. With the right choices, we can make the model fit the 'baseline' without being too distracted by the noise.

As we'll see in a minute, there are several different types of regularization. For the simple machine learning model we're using here, we'll use the popular L2 form of regularization. 

Regularization means altering the cost function to penalize more complicated models. Specifically, it introduces an extra term to the cost function, called the regularization term.

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} w_{j}^{2}\]

\(\lambda\) is the regularization parameter and we set \(\lambda > 0\). Because \(\lambda > 0\) we're penalizing the cost function for higher values of \(w\), so gradient descent will tend to avoid them when we're minimizing. The regularization term is a square term; this modified cost function is a ridge regression or L2 form of regularization.

You might think that regularization would reduce some of the \(w\) parameters to zero, but in practice, that’s not what happens. It reduces their contribution substantially, but often not totally. You can still end up with a model that’s more computationally complex than it needs to be, but it won’t overfit.

You probably noticed the \(b\) values appeared in the model but not in the cost function or the regularized cost function. That's because in practice, the \(b\) value makes very little difference, but it does complicate the math, so I'm ignoring it here to make our lives easier.

Types of regularization

This is the ridge regression or L2 form of regularization (that we saw in the previous section):

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} w_{j}^{2}\]

The L1 form is a bit simpler, it's sometimes known as lasso which is an acronym meaning Least Absolute Shrinkage and Selection Operator.

\[J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}( \hat{y}^{(i)} - y^(i) )^2 + \frac{\lambda}{2m}\sum_{j=1}^{n} |w_{j}|\]

Of course, you can combine L1 and L2 regularization, which is something called elastic net regularization. It's more accurate than L1 and L2, but the computational complexity is higher.

A more complex form of regularization is entropy regularization which is used a lot in reinforcement learning.

For most cases, the L2 form works just fine.

Regularization in more complex machine learning models - dropping out

Linear machine learning models are very simple, but that about logistic models or the more complex neural nets? As it turns out, regularization works for neural nets and other complex models too.

Overfitting in neural nets can occur due to "over-reliance" on a small number of nodes and their connections.  To regularize the network, we randomly drop out nodes during the raining process, this is called drop out regularization, and for once, we have a well-named piece of jargon. The net effect of drop out regularization is a "smoother" network that models the baseline and not the noise.

Regularization in Python

The scikit-learn package has the functionality you need. In particular, check out the Lasso,  Ridge, ElasticNet and GridSearchCV functions. Dropout regularization in neural networks is a bit more complicated and in my view it needs a little more standardization in the libraries (which is a fancy way of saying, you'll need to check the current state of the documents).

Seeking \(\lambda\)

Given that \(\lambda\) is a hyperparameter and important, how do we calculate it? The answer is using cross-validation. We can either set up a search or step through various \(\lambda\) values to see which values minimize the cost function. This probably doesn't seem very satisfactory to you and frankly, it isn't. How to cheaply find \(\lambda\) is an area of research so maybe we'll have better answers in a few years' time. 

The bottom line

Underfitting (bias) and overfitting (variance) can kill machine learning models (and models in general). Regularization is a powerful method for preventing these problems. Despite the large equations, it's actually quite easy to implement. 

Monday, April 21, 2025

The parakeets of Belfast

It started in London

Over ten years ago, I was in suburban London and I got a shock; I thought I'd seen a parrot flying wild. I looked again, and this time, I saw two of them. They were about 30 cm (1 ft) long, bright green, with a rose-colored ring around their necks.


(Dr. Raju Kasambe, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons - this is what they look like in their native home)

I wasn't hallucinating, what I saw were wild parakeets that had established breeding colonies in London. Formally, they are rose-ringed parakeets or Psittacula krameri. A 1997 survey found there were about 3,500 breeding pairs, with a 2012 survey finding 32,000; these numbers were for London alone. There are likely far more now.

The birds seemed to have started off in south-west London before spreading to other parts of the city. Bear in mind, London has lots of quite large parks in urban and suburban areas that are only a short flight away from each other. Lots of people put out food for the birds, so there's plenty for them to eat.

(Parakeet in Garden, London N14 by Christine Matthews, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons)

Parakeets are natively found in a band from sub-saharan Africa to India. Given that they're from a hot climate, the obvious question is, how do they survive the English winters? Part of the answer is the mild British weather; despite being quite far north, the UK climate is strongly affected by the Gulf Stream which gives cooler summers and warmer winters. It rarely snows in the south of England and it rarely gets extremely cold, which means the birds can overwinter without dying off. The other part of the answer is the parakeets' range in their home environment; they're found as far as the foothills of the Himalayas, which are obviously pretty cool.

Jimi Hendrix or The African Queen or...?

The next most obvious question is, how did these parakeets get there? There are some great legends, so I'm going to tell them.

One story says it all goes back to the movie "The African Queen" which was partly filmed in Isleworth just outside London. The legend has it, the production company shipped in parakeets for filming and then let them loose at the end of the shoot. The birds moved to Twickenham (next door to Isleworth), which they found hospitable, and they spread from there.

If you don't like that legend, then maybe you'd like one of the others. Jimi Hendrix is said to have had parakeets when he lived in London in the 1960's. One day, he decided to set them free, and so the wild parakeet population got started.

(Warner/Reprise RecordsUploaded by We hope at en.wikipedia, Public domain, via Wikimedia Commons)

There are other legends involving Henry VIII, the Great Storm of 1987, and more. You can read all about them online.

The reality is probably much more mundane. Parakeets were popular as pets. As people got bored of them, the easiest thing to do is just release them. With enough people releasing birds, you've got a viable breeding population.

Talking

Parakeets are famously noisy birds, so they just add to the din in an already noisy city. Notably, parakeets can mimic human speech very clearly and are among the best talking parrots. It's a bit odd to think there are thousands of wild birds in London capable of mimicking human speech; maybe they'll have cockney accents.

Glasgow

By 2019, the parakeets had made their way north to Glasgow and set up home in Victoria Park, and from there, they've been colonizing Scotland. The population in Glasgow had the distinction of being the most northerly parrot population anywhere in the world, but it now looks as if the birds have moved even further north.

Here's a map from the NBN Atlas (https://species.nbnatlas.org/species/NHMSYS0000530792) showing current confirmed and unconfirmed sightings of the parakeets in the UK.

Dublin

Parakeets were first spotted in Dublin around 2012. By 2020, they'd started spreading outside Dublin into the surrounding countryside and towns.

As one of the local commentators said, the fact the parakeets are bright green seems appropriate for Ireland.

How did the parakeets get to Dublin? Bear in mind, Jimi Hendrix didn't live in Dublin and "The African Queen" wasn't shot there. Of course, they could have flown there from London, but the Irish Sea is a rough sea and it's a long way to fly across open water. The most likely explanation is the most mundane: people releasing their pets when they got bored of them.

Belfast

Recently (2025), parakeets have been spotted in Belfast. Only a small population of 15 or so, but they're there. If you want to go see them, head up to the Waterworks Park in the north of the city.

They're likely to have spread up from Dublin rather than having spread across the Irish sea.

Brussels sprouts

It's not just the UK and Ireland who are host to the green invaders; there are are something like 200 populations of parakeets in Europe. Brussels has them too, plus Alexandrine parakeets and monk parakeets.

(Frank Vassen from Brussels, Belgium, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons)

It is credible that parakeets could have spread from the UK across the channel. You can clearly see France from Kent and birds regularly make the crossing. However the timing and distribution doesn't work. What's much more likely is the accidental or deliberate release of pets.

It's not just the UK, Ireland, and Belgium that have parakeets, they've spread as far as Poland (see https://www.researchgate.net/publication/381577380_Parrots_in_the_wild_in_Polish_cities). The Polish article has the map above that reports on known parakeet populations in Europe. It's a little behind the times (the Irish parakeets aren't there), but it does give you a good sense of how far they've moved.

This is not good

Despite their cuteness, they're an invasive species and compete with native bird populations. Both the UK and Ireland are considering or have considered culls, but as of the time of writing, nothing has been decided.

The key Belfast question

Are you a Catholic parakeet or a Protestant parakeet?

Wednesday, April 16, 2025

Imagination in Action: The Good, the Bad, and the Ugly

What it Was

This was an all-day conference at MIT focused on AI—covering new innovations, business implications, and future directions. There were multiple stages with numerous talks, blending academia and industry. The event ran from 7 a.m. to around 7:30 p.m., and drew roughly 1,000 attendees.

The Good

The speakers were relevant and excellent. I heard firsthand how AI is being used in large insurance companies, automotive firms, and startups—all from people actively working in the field. Industry luminaries shared valuable insights; I particularly enjoyed Anshul Ramachandran from Windsurf, and of course, Stephen Wolfram is always engaging.

The academic speakers contributed thoughtful perspectives on the future of AI. This wasn’t an “academic” conference in the traditional sense—it was firmly grounded in real-world experience.

From what I gathered, some large businesses are well along the path of AI adoption, both internally and in customer-facing applications. Many have already gone through the growing pains and ironed out the kinks.

Both Harvard and MIT are producing graduates with strong AI skills who are ready to drive results. In other words, the local talent pool is robust. (Though I did hear a very entertaining story about a so-called “AI-native” developer and the rookie mistake they made…)

The networking was excellent. I met some wonderful people—some exploring AI applications, others contemplating new ventures, and many seasoned veterans. Everyone I spoke with was appropriately senior and had thoughtful, engaging perspectives.

The Bad

Not much to complain about, but one observation stood out. I was in a smaller session where a senior speaker had just finished. As the next speaker began, the previous one started a loud conversation with another senior attendee—right by the entrance, less than an arm’s length from the door. Even after being asked to be quieter, they continued. I found this disrespectful and discourteous, especially considering their seniority. Unfortunately, I witnessed similar behavior a few other times.

The Ugly

One thing really stuck with me. Several speakers were asked about AI’s impact on employment. The answers were nearly identical: “It will change employment, but overall demand will increase, so I’m not worried.” Urghhh...

Yes, historically, new technologies have increased employment rather than reduced it—but this glosses over the pain of transition. In every technological shift, people have been left behind, often facing serious economic consequences. I’ve seen it firsthand.

Here’s a thought experiment to make the point: imagine you’ve been a clerk in a rural Alabama town for twenty years. AI takes your job. What now? The new AI-driven jobs are likely in big cities you can’t move to, requiring skills you don’t have and can’t easily acquire. Local job options are limited and pay less. For you, AI is a major negative, and no amount of job creation elsewhere will make up for it. My point is: the real world is more than just developers. We need to acknowledge that people will experience real hardship in this transition.

The Bottom Line

This was a worthwhile use of my time. It gave me a clear sense of where early adopters are with AI in business, and also helped me realize I know more than I thought. Will I return next year? Probably.

Monday, April 14, 2025

Why a lot of confidence intervals are wrong

Lots of things are proportions

In statistics, a proportion is a number that can vary from 0 to 1. Proportions come up all the time in business and here are just a few examples.

  • Conversion rates on websites (fraction of visitors who buy something).
  • Opinion poll results (e.g. fraction of businesses who think the economy will improve in the next six months).
  • Market share.
If you can show something meaningful on a pie chart, it's probably a proportion.

(Amousey, CC0, via Wikimedia Commons)

Often, these proportions are quoted with a confidence interval or margin of error, so you hear statements like "42% said they would vote for X and 44% for Y. The survey had a 3% margin of error". In this blog post, I'm going to show you why the confidence interval, or margin of error, can be very wrong in some cases.

We're going to deal with estimates of the actual mean. In many cases, we don't actually know the true (population) mean, we're estimating based on a sample. The mean of our sample is our best guess at the population mean and the confidence interval gives us an indication of how confident we are in our estimate of the mean. But as we'll see, the usual calculation of confidence interval can go very wrong.

We're going to start with some text book math, then I'm going to show you when it goes badly astray, then we're going to deal with a more meaningful way forward.

Estimating the mean and the confidence interval

Estimating the population mean is very straightforward and very obvious. Let's take a simple example to help explain the math. Imagine a town with 38,000 residents and they'll vote on whether the town government should build a new fire station or not. We'll call the actual vote results (proportion in favor of the fire station) the population mean. You want to forecast the results of the vote, so you run a survey, the proportion you get from the survey will be a sample mean. Let's say you survey 500 people (the sample size) and 350 said yes (this is the number of successes). Assuming the survey is unbiased, our best estimate (of the population mean) is give by (the sample mean):

\(\hat{p} = \dfrac{m}{n} = \dfrac{350}{500} = 0.7\)

But how certain are we of this number? If we had surveyed all 38,000 residents, we'd probably get a very, very accurate number, but the cost of the survey goes up with the number of respondents. On the other hand, if we asked 10 residents, our results aren't likely to be accurate. So how many people do we need to ask? Another way of saying this is, how certain are we that our sample mean is close to the population mean?

The textbook approach to answering this question is to use a confidence interval. To greatly simplify, the confidence interval is two numbers (an upper and lower number) between which we think there's a 95% probability the population mean lies. The probability doesn't have to be 95%, but that's the usual choice. The other usual choice is to express the confidence interval relative to the sample mean, so the lower bound is the sample mean minus a value, and the upper bound is the sample mean plus the same value. For our fire station example, we might say something like \(0.7 \pm 0.04\), which is a 4% margin of error.

Here's the formula:

\(\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)

You sometimes hear people call this the Wald interval, named after Abraham Wald. The symbol \(z_{\frac{\alpha}{2}}\) comes from the normal distribution, and for a 95% confidence interval, it's close to 1.96. This formula is an approximation. It's been used for decades because it's easy to use and cheap to calculate, which was important when computations were expensive.

Let's plug some numbers into the Wald formula as an example, To go back to our fire station opinion poll, we can put the numbers in and get a 95% confidence interval. Here's how it works out:

\(0.7 \pm 1.96 \sqrt{\dfrac{0.7(1-0.7)}{500}} = 0.7 \pm 0.04\)

So we think our survey is pretty accurate, we're 95% sure the real mean is between 0.66 and 0.74. This is exactly the calculation people use for opinion polls, in our case, our margin of error is 4%.

So far so good, but there are problems...

(The actual meaning of the confidence interval is more nuanced and more complicated. If we were to repeat the survey an infinite number of times and generate an infinite number of confidence intervals, then 95% of the confidence intervals would contain the population mean. This definition gets us into the deeper meaning of statistics and is harder to understand, so I've given the usual 'simpler' explanation above. Just be aware that this stuff gets complicated and language matters a lot.) 

It all goes wrong at the extremes - and the extremes happen a lot

What most of the textbooks don't tell you is that the formula for the confidence interval is an approximation and that it breaks down:

  • when \(\hat{p}\) is close to 0 or 1.
  • when n is small.

Unfortunately, in business, we often run into these cases. Let's take look at a conversion rate example. Imagine we run a very short test and find that from 100 website visitors, only 2 converted. We can express our conversion rate as:

\(0.02 \pm 1.96 \sqrt{\dfrac{0.02(1-0.02)}{100}} = 0.02 \pm 0.027\)

Before we go on, stop and look at this result. Can you spot the problem?

The confidence interval goes from -0.007 to  0.047. In other words, we're saying there's a probability the conversion rate can be negative. This is plainly absurd.

Let's take another example. Imagine we want to know the proportion of dog lovers there are in a town of cat lovers. We ask 25 people do they love cats or dogs and 25 of them said cats. Here's our estimate of the proportion of cat lovers and dog lovers:

Dog lovers = \(0 \pm 1.96 \sqrt{\dfrac{0.0(1-0)}{25}} = 0 \pm 0\)

Cat lovers = \(1 \pm 1.96 \sqrt{\dfrac{1(1-1)}{25}} = 1 \pm 0\)

These results suggest we're 100% sure everyone is cat lover and no-one is a dog lover. Does this really seem sensible to you? Instead of cats and dogs, imagine it's politicians. Even in areas that vote heavily for one party, there are some supporters of other parties. Intuitively, our confidence interval shouldn't be zero.

The Wald interval breaks down because it's based on an approximation. When the approximation no longer holds, you get nonsense results. 

In the next section, I'll explain how you can do better.

(I've seen "analysts" with several years' experience argue that these type of results are perfectly fine. They didn't understand the math but they were willing to defend obviously wrong results because it came out of a formula they know. This is really bad for business, Amazon would never make these kinds of mistakes and neither should your business.)

A better alternative #1: Wilson score intervals

The Wilson score interval makes a different set of approximations than the Wald interval, making it more accurate but more complicated to calculate. I'm going to avoid the theory for now an jump straight into the formula:

\(  \dfrac{\hat{p}  + \dfrac{z^2_{\frac{\alpha}{a}}}{2n} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z^2_{\frac{\alpha}{2}}}{4n^2}}}{1 + \frac{z^2_{\frac{\alpha}{2}}}{n}}\)

This is a scary looking formula and it's much harder to implement that the Wald interval, but the good news is, there are several implementations in Python. I'll show you a few, first one in statsmodels and the second one using scipy.

import numpy as np
from statsmodels.stats.proportion import proportion_confint
from scipy import stats

# Sample data
n = 100 # number of observations
k = 2 # number of successes

# Calculate Wilson score interval using statsmodels
wilson_ci = proportion_confint(k, n, alpha=0.05, method='wilson')
print("Wilson Score Interval (statsmodels):")
print(f"Lower bound: {wilson_ci[0]:.4f}")
print(f"Upper bound: {wilson_ci[1]:.4f}")

# Calculate Wilson score interval using scipy
# Wilson score interval formula implementation
wilson_ci_scipy = stats.binomtest(2, 100).proportion_ci(method='wilson')
print("\nWilson Score Interval (scipy):")
print(f"Lower bound: {wilson_ci_scipy[0]:.4f}")
print(f"Upper bound: {wilson_ci_scipy[1]:.4f}")

As you might expect, the two methods give the same results. 

For the conversion rate sample (100 visitors, 2 purchases), we get the lower interval as 0.0055 and the upper as 0.0700, which is an improvement because the lower bound is above zero.  The score interval makes sense.

For the cats and dogs example, we get for dogs: lower=0, upper=0.1332, for cats we get: lower=0.8668, upper=1. This seems much better too. We've allowed for the town to have dog lovers in it which chimes with our intuition.

The Wilson score interval has several neat properties:

  • It will never go below 0
  • It will never go above 1
  • It gives accurate answers when n is small and when \(\hat{p}\) is close to zero or 1.
  • The Wald interval will sometimes give you a single value, the Wilson score interval will always give you two (which is what you want).
  • The Wilson score interval is close to the Wald interval for large n and where \(\hat{p}\) is close to 0 or 1. 

You can read more about the Wilson score interval in this excellent blog post: https://www.econometrics.blog/post/the-wilson-confidence-interval-for-a-proportion/ Take a look at the charts, they show you that the Wilson score interval gives much more accurate results for small n and when \(\hat{p}\) is close to zero or 1.

This reference provides a fuller explanation of the theory: https://www.mwsug.org/proceedings/2008/pharma/MWSUG-2008-P08.pdf

A better alternative #2: Agresti-Coull

The Agresti-Coull is another interval like the Wilson score interval. Again, it's based on a different set of approximations and a very simple idea. The starting point is to take the data and add two success observations and two failure observations. Using the labels I gave you earlier; m is the number of success measurements and n the total number of measurements, then the Agresti-Coull interval uses m + 2 and n + 4. Here's what it looks like in code:

# Calculate Agresti-Coull score interval using statsmodels
ag_ci = proportion_confint(k, n, alpha=0.05, method='agresti_coull')
print(f"Agrest-Coulllson Score Interval (statsmodels):")
print(f"Lower bound: {ag_ci[0]:.4f}")
print(f"Upper bound: {ag_ci[1]:.4f}")

The Agresti-Coull interval is an approximation to the Wilson score interval, so unless there's a computation reason to do something different, you should use the Wilson score interval.

Other alternatives

As well as Wilson and Agresti-Coull, there are a bunch of alternatives, including Clopper-Pearson, Jeffrey (Bayesian), and more. Most libraries have a range of methods you can apply.

What to do

Generally speaking, be sure to know the limitations of all the statistical methods you use and select the right methods for your data. Don't assume that something is safe to use because "everyone" is using it. Occasionally, the methods you use will flag up junk results (e.g. implying a negative conversion rate). If this happens to you, it should be a sign that your algorithms have broken down and that it's time to go back to theory.

For proportions, if your proportion mean is "close" to 0.5 and your sample size is large (say, over 100), use the Wald interval. Otherwise, use the Wilson score interval. If you have to use one and only one method, use the Wilson score interval.

 

Tuesday, April 8, 2025

Identifying people by how they type - a unique "fist"

The other day, I read a breathless article on how AI could identify a human by what they typed and how they typed it. The idea was, each person has a unique typing "fingerprint" or "fist", meaning a combination of their speed, the mistakes they make, their pauses etc. Obviously, systems have been around for some years now that distinguish between machine typing and human typing, but the new systems go further than that; they identify individuals.

(Salino01, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

The article suggested this was something new and unique, but I'm not sure it is. Read the following paragraph and guess when it was written:

"Radio Security would be monitoring the call, as they monitored every call from an agent. Those instruments which measured the minute peculiarities in an operator's 'fist' would at once detect it wasn't Strangways at the key. Mary Trueblood had been shown the forest of dials in the quiet room on the top floor at headquarters, had watched as the dancing hands registered the weight of each pulse, the speed of each cipher group, the stumble over a particular letter. The Controller had explained it all to her when she had joined the Caribbean station five years before--how a buzzer would sound and the contact be automatically broken if the wrong operator had come on the air."

The excerpt is from Ian Fleming's Dr. No and was written in 1957 (published in 1958). However, this idea goes back further in time. I've read articles about World War II radio communication where the women working in the receiving stations could identify who was sending morse code by their patterns of transmission (using the same methods Ian Fleming talked about). There's even mention of it on a Wikipedia page and there are several online articles about ham radio operators recognizing each other's "fists". 

What AI is doing here isn't new and unique. It's doing something that's been possible for a long time but doing it more quickly and more cheaply. The latter part is the most important piece of the story, by reducing the cost, AI enables the technology to be widely used. 

In the past, the press and other commentators have missed important societal changes brought on by rapid technology cost reductions. This happened because reporters focused on technical gee-whiz 'breakthrough' stories rather than cost reduction stories The obvious example is containerization and the consequential huge reduction in shipping costs that enabled global competition in manufactured goods and from there to regional deindustrialization. Low shipping costs are one of the main reasons why we can't easily go back to the good old days of manufacturing in deindustrialized area. But how often do you see shipping costs discussed in the press? Given the press missed the impact of containerization, what are they going to miss for the impact of AI?

Journalists have limited word counts for articles. The article I read about typing "fists" should have talked about the implications of the cost reduction instead of the technical 'breakthrough' aspect. Some journalists (and newspapers) just seem to miss the point.

Tuesday, April 1, 2025

Platypus are weird

Weird stuff

I was looking on the internet for something and stumbled on some weird facts about platypus that I didn't know. I thought it would be fun to blog about it.
(Charles J. Sharp, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

Obvious weirdness

There are a couple of facts most people know about platypus, so I'll only mention them in passing:

  • They are one of the few mammals to lay eggs.
  • They have a beak, or more formally, a bill.
  • When the first samples were brought to the UK, scientists thought they were fake.
Let's get on to the more interesting facts.

Venom

Only a handful of mammals are venomous, including the platypus. The male has a venom spur on its hind legs as you can see in the image below. 

(The original uploader was Elonnon at English Wikipedia., CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons)

Biologically, it's a modified sweat gland that produces venom.  It's thought the males use these spurs to fight other males for access to females. 

The venom is quite powerful and can affect humans quite strongly. Here's an alarming description from Wikipedia:

Although powerful enough to paralyze smaller animals, the venom is not lethal to humans. Still, it produces excruciating pain that may be intense enough to incapacitate a victim. Swelling rapidly develops around the entry wound and gradually spreads outward. Information obtained from case studies shows that the pain develops into a long-lasting hyperalgesia that can persist for months but usually lasts from a few days to a few weeks. A clinical report from 1992 showed that the severe pain was persistent and did not respond to morphine.


Electrosense

The platypus' bill is filled with sensory receptors that can detect incredibly small movements in the water like those made by the freshwater shrimp it feeds on. It also has a large number of electroreceptors that can sense biological electrical signals, for example, the muscle contractions of its prey.  It can combine these two signals as a location mechanism. (See "The platypus bill, push rods and electroreception.")

No functional stomach

A stomach is an organ that secretes digestive enzymes and acids to break down food. The platypus doesn't have one. Instead, it's food goes directly to its intestines. It chews its food so thoroughly, there's not much need for digestive acids and enzymes, and it eats so frequently, there's not much need for storage. (See "Some platypus myths.")

What does platypus taste like

Platypus are protected in Australia and you can't hunt and eat them. The aboriginal people didn't eat them because of their smell. In the 1920s, some miners did eat one and reported the taste was “a somewhat oily dish, with a taste between those of red herring and wild duck”.  There's surprisingly little else published on their taste. You can read more here

What does dead platypus milk taste like?

Platypus females produce milk through their skin (they don't have nipples). This means of milk production is more susceptible to bugs, so it's probably no surprise platypus milk contains antibiotics (see this reference.)

But what does platypus milk taste like? More specifically, what does milk from a dead platypus taste like? It turns out, we actually know; it doesn't actually taste or smell of anything.

Plurals

This article goes into the subject on some depth. To cut to the chase, platypi is definitely wrong, but either platypus or platypuses are correct.

Baby platypus

Baby platypus are mostly called puggles, although there's some push back to that name.

Theme tune

Apparently, there was a Disney TV series called "Phineas and Ferb" that featured a platypus. Here's his theme song.

There aren't many other songs about platypus. The only other one I could find was "Platypus (I hate you)" by Green Day which doesn't seem to have a lot to do with Australian mammals.

Tuesday, March 25, 2025

How to improve data quality

Why data is often bad and what you can do about it

In this blog post, I'm going to talk through what you can do to improve data quality. This covers three areas: automated issue detection, some techniques you can use to find errors, and most importantly, the people processes behind them.

(Bart Everson, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons)

Before we get going, it's important to note that bad data has a number of causes, some of which aren't under a company's control, e.g. ingesting 3rd party data. This means automated fixes aren't always possible. My belief is, even if you can't fix errors quickly, you need to know about them because your customers will.

Automate error detection

Automated error detection is the key to fixing data issues at a reasonable cost. The idea is, an automated system checks incoming data for a range of problems and flags errors or concerns. You can adapt these systems to give you error counts over time, the goal is measuring progress on reducing data issues, for example producing a daily data quality score.

The obvious objection is, if you can spot the errors, you should fix them in your data ingestion process so no need for an error detection system. Sadly, in the real world, things aren't so simple:

  • If your data ingestion team was doing this already, there would be no data issues. The fact that there are errors tells you that you need to do something new.
  • Ingestion systems focus on stability and handling known errors. Very rarely do they report on errors they can't fix. Frankly, for most dev teams, finding new data errors isn't a priority.
  • The lead time to add new data quality checks to ingestion systems can be weeks or months. I've see people add new checks to standalone automated error checking systems in day.

If possible, an error detection system should integrate with a company's error ticket system, for example, automatically creating and assigning Jira tickets. This has some consequences as we'll see. 

The people process 

We can introduce as much error detection automation as we wish, but ultimately it's down to people to fix data issues. The biggest problem that occurs in practice is the split of responsibilities. In a company, it's often true that one team creates an automated system to find errors ('spotter' team) while another team is responsible for fixing them ('fixer' team). This sets up the potential for conflict right from the start. To win, you have to manage the issues.

Typically, there are several areas of conflict:

  • The 'spotter' team is creating more work for the 'fixer' team. The more productive the 'spotter' team is, the harder the 'fixer' team has to work.
  • The 'spotter' team has to create meaningful error messages that the 'fixer' team has to be able to interpret. Frankly, technical people are often very bad at writing understandable error messages.
  • For reasons we'll go into later, sometimes automated systems produce a tsunami of error messages, flooding the 'fixer' team.
  • The 'fixer' team may have to work out of hours and resolve issues quickly, whereas the 'spotter' team works office hours and is under much less stress.
  • The 'fixer' teams bears the consequences of any 'spotter' team failures.
  • Goals (meaning, OKRs etc.) aren't aligned. One team may have an incentive to reduce errors, while a team they are reliant on does not.

I could go on, but I think you get the point.

Here's how I've approached this problem.

  1. I've made sure I know the complete process for resolving data issues. This means knowing who is responsible for fixing errors, how they do it, and the level of effort. It's important to know any external constraints, for example, if data is externally sourced it may take some time to resolve issues.
  2. I make the 'spotter' team and the 'fixer' team sit down together to discuss the project and make sure that they understand each other's goals. To be clear, it's not enough to get the managers talking, the people doing the work have to talk. Managers sometimes have other goals that get in the way.
  3. The 'spotter' team must realize that the 'fixer' team is their customer. That means error messages must be in plain English (and must give easily understood steps for resolution) and the system mustn't flood the 'fixer' team with errors. More generally, the 'spotter' team must adapt their system to the needs of the 'fixer' team.
  4. Everyone must understand that there will be teething problems.
  5. Where possible, I've aligned incentives (e.g. objectives, OKRs) to make sure everyone is focused on the end goal. If you can't align incentives, this may well sink your project.

As I've hinted, the biggest impediment to success is company culture as represented by people issues. I've run into issues where managers (and teams) have been completely resistant to error detection (even when the company has known error issues) and/or resistant to some checks. I'm going to be frank, if you can't get the 'fixer' team to buy in, the project won't work.

Simplicity, plain English, error levels, and flooding

It's important to start automated error detection with easy errors, like absent or impossible data. For example, a conversion rate is a number between 0 and 1, so a conversion rate of 1.2 is an error. 

For each check, make sure the text of the error message uses the simplest language you can. Ideally, bring in the 'fixer' team to work on the text. Error messages should clearly explain the problem, give enough information to locate it, and where possible, give next steps.

You should prioritize issues found by the error detection system using an easy-to-understand scheme, for example:

  • "FATAL" means an error will cause the system to fail in some way
  • "ERROR" means the results of the system will be affected negatively
  • "WARNING" means something isn't right and needs further investigation.
  • "INFO" means this is FYI and you don't need to take action.

In reality, INFO type messages will be ignored and you may receive complaints for generating them. However, they're often a good way to introduce new checks; a new error check might start off at an "INFO" level while you make sure the 'fixer' team knows how to handle it, then it gets promoted to a "WARNING" to give the team time to adjust, finally becoming "ERROR" or "FATAL". The actual process you use is up to you, but you get the point.

Sometimes, a single problem can trigger multiple error tests. For example, imagine we're dealing with e-commerce data and we have a feed of sales and returns. The sales feed suffers a partial failure. Here's what might happen:

  • A sales volume error might be triggered.
  • There will be some returns unmatched with sales, so this may trigger another error.
  • The return rate figure might spike (because we're missing sales, not missing returns).
  • ...

So one failure might cause multiple error messages. This can flood the 'fixer' team with error messages without providing any helpful context. There are two things you have to do as a 'spotter' team:

  1. You can't flood 'fixer' teams with error messages. This is unhelpful and causes more confusion. They won't know where to start to fix the problem. You need to figure out how to meaningfully throttle messages.
  2. You have to provide higher level diagnosis and fixes. If multiple tests are triggered it may be because there's one cause. Where you can, consolidate messages and indicate next steps (e.g. "I'm seeing sales volume failures, unmatched return failures, and a return rate spike. This may be caused by missing sales data. The next step is to investigate if all sales data is present. Here are details of the failures I've found...")

This requirements have implications for how an automated error detection system is built. 

I'm going to turn now to some math and some checks you should consider.

Floating checks

Let's say you're checking sales data from an automotive retailer. You know that sales go up and down over time and these fluctuations can be over the period of months or over the period of years. You want to detect sudden upward or downward spikes. 

The simplest way of detecting anomalies like this is to use maximum and minimum threshold checks. The problem is, with business changes over time, you can end up falsely triggering errors or missing failures. 

Let's imagine that sales are currently $1 million a day and you set an upper error detection threshold of $10 million and a lower threshold of $0.25 million.  If you see sales numbers above $10 million or less than $0.25 million, you flag an error. As the company grows, it may reach $10 million naturally, falsely triggering an alert. On the flip side, if sales are usually $10 million, a drop to $2 million should trigger an alert, but with a $0.25 million threshold, it won't. The solution here is to use floating minimum and maximum values, for example, for any day, we look at the previous 10 days, work out a mean and set thresholds based on that mean (e.g. 2xmean, 0.5xmean), we then compare the day's sales to these floating thresholds. In reality, the process is more involved, but you get the point. 

In practice, most error checks will use some form of floating check.

Deviations from expected distributions

This is more advanced topic, but you can sometimes use statistics to find when something is wrong. A couple of examples will help. 

Let's imagine you're a large online retailer. You have occasional problems with your system falsely duplicating order data, for example, a customer buys a pen but sometimes the data shows it as two pens. The problem with deduplicating this data is that some customers will really buy two pens. Given this, how might you detect the presence of duplicate data in your system?

The answer lies in analyzing the distribution of your data. 

Often, with this kind of data there's an expected distribution, let's say it's a Poisson distribution. In the absence of  duplication, your order size distribution might look like this.

With 100% order duplication, it looks like this. Although the distributions look the same, if you look more closely you'll see there are no odd number values and the maximum value is twice what it was for no duplication.

With 25% order duplication, it looks like this. Note the characteristic zig-zag pattern.

This nice thing is, you don't even need to know what the "real" distribution should look like. All you need to detect is the zig-zag pattern introduced by duplication, or even the absence of odd number values. In fact, you can even attempt to quantify how much duplication is present.

Sometimes, you can use expected distributions more directly. Let's say you're maintaining data on company size as measured by employee size. You have data on the number of companies with different numbers of employees. Theoretically, this should be a power law distribution. When you plot the distribution, you see something like this (comparing theoretical (line) and actual (dots)).

This plot tells you you have some potential anomalies at 10, 100, 1,000 etc. with a huge outlier at 100,000. It's often the cases that data is based on estimates and that people use round numbers as estimates. The anomalies at 10, 100, 1,000 might be perfectly OK (you don't need to alert on everything you find), but the count of companies with 100,000 employees seems way off. This kind of extreme discrepancy from an expected distribution may well be worth alerting on.

Anomaly detection

It will probably come as no surprise to you to hear that data scientists have applied machine learning to spot anomalies, creating a sub-discipline of "anomaly detection" techniques. The most commonly used method is something called an "isolation forest" which is available from the popular scikit-learn library.

I'm going to suggest some caution here. This approach may take some time to develop and deploy. The model has to be trained and you have to be extremely careful about false positives. You also have to consider the action you want the 'fixer' team to take; it can't be "go look into this". For example, imagine a model that flags up something as anomaly. Without an explanation of why it's an anomaly, it's very difficult for a 'fixer' team to know what to do. 

My suggestion is, detect obvious errors first, then develop a machine-learning based anomaly detector and see what it finds. It might be that you only run the anomaly detector on data that you think is clean.

Putting it all together

You can use error detection systems to find data errors in your system. Because these systems aren't tied to your production system, you can move very fast and add new data checks very quickly. You can also use error detection systems to create data quality metrics.  

The main problem you'll face is people issues. These can be severe, so plan accordingly. Make sure goals are aligned and communication and trust are excellent.

Get started with an MVP using simple checks. Debug the people and technical process. Make sure people are resolving issues.

Add new checks as the system becomes more accepted. Makes sure your system never produces a tsunami of tickets and consolidate your findings where you can.

Statistical analysis can reveal errors that other forms of error check can't. Consider using these methods later on in the process.

Use advanced data science methods, like an isolation forest, sparingly and only when the rest of the system is up and running.

Thursday, March 20, 2025

Compliance!

Compliance

Compliance means a company, and its employees, are following the rules so it doesn't get punished by regulators (e.g. fines), courts (e.g. adverse legal judgments), or the market (stock price drop), or all of them. Rules means following financial and privacy law, but also obeying contract rules. On the face of it, this all sounds like something only the finance and legal departments need to worry about, but increasingly data people (analysts, data scientists, data engineers) need to follow compliance rules too. In this blog post, I'll explain why compliance applies to you (data people) and what you can do about it.

(Get compliance wrong, and someone like this may be in your future.  InfoGibraltarCC BY 2.0, via Wikimedia Commons)

I'm not a lawyer, so don't take legal advice from me. What you should do is read this blog post, think about gaps in your compliance processes, and talk to your legal team.

Private data on people

By now, most data people understand that data that identifies individuals is covered by privacy laws and needs to be handled carefully. Data people also understand that there can be large fines for breaches or mishandling data. Unfortunately, this understanding often isn't enough and privacy laws are more complex and broader than many technical staff realize.

(Private Property sign by Oast House Archive, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons)

Several data privacy laws have an extraterritorial provision which means the law applies anywhere in the world (most notably, the GDPR). For example, a Mexican company processing data on French residents is covered by the GDPR even though the data processing takes place in Mexico. For a company operating internationally, this means obeying several sets of laws, which means in practice the strictest rules are used for everyone.

What is personally identifiable information (PII) sometimes isn't clear and can change suddenly. Most famously, the Court of Justice of the European Union (CJEU) ruled in the Breyer case that IP addresses can be PII under some circumstances. I'm not going to dive into the ruling here (you can look it up), but the court's logic is clear. What this ruling illustrates is that "common sense" views of what is and is not PII aren't good enough.  

The GDPR defines a subset of data on people as "special categories of personal data" which are subject to more stringent regulation (this guide has more details). This includes data on sexuality, religion, political activities etc. Once again, this seems obvious in theory, but in practice is much harder. For example, the name of someone's partner can reveal their sexuality and is therefore sensitive data.

There are two types of private data on people companies handle that are often overlooked. Employee data is clearly private, but is usually closely held for obvious reasons. Customer data in CRM systems is also private data on people but tends to be less protected. Most CRM systems have prospect and contact names, job titles, phone numbers etc. and I've even heard of systems that list customers' hobbies and interests. Data protection rules apply to these systems too.

I've only just scratched the surface of the rules surrounding processing data on people but hopefully I've made clear that things aren't as straightforward as they appear. A company can break the law and be fined if its staff (e.g. data analysts, data scientists, data engineers etc.) handle data in a way contrary to the law.

Trading based on confidential information

Many companies provide services to other companies, e.g. HR, payroll, internet, etc. This gives service providers' employees access to confidential information on their customers. If you're a service provider, should you let your employees make securities transactions based on confidential customer information?

(Harshitha BN, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

A hypothetical case can make the risks clearer. Let's imagine a payroll company provides services to other companies, including several large companies. A data analyst at the payroll company spots ahead of time that one of their customers is laying off a large number of its employees. The data analyst trades securities in that company based on this confidential information. Later on, the fact that the data analyst made those trades becomes publicly known.

There are several possible consequences here.

  • Depending on the jurisdiction, this may count as "insider trading" and be illegal. It could lead to arrests and consequential bad publicity and reputational damage.
  • This could be a breach of contract and could lead to the service provider losing a customer.
  • At the very least, there will be commercial repercussions because the service provider has violated customer trust.

Imagine you're a company providing services to other companies. Regardless of the law, do you think it's a good idea for your employees to be buying or selling securities based on their confidential customer knowledge?

Legal contracts

This is a trickier area and gets companies into trouble. It's easiest if I give you a hypothetical case and point out the problems.

(Staselnik, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A company, ServiceCo, sells services into the mining industry in different countries. As part of its services, it sells a "MiningNetwork" product that lists mining companies and the names of people in various jobs in them (e.g. safety officers, geologists and so on). It also produces regular reports on the mining industry that it makes available for free on its website as part of its marketing efforts, this is called the "Mining Today Report". 

For sales prospecting purposes, the sales team buys data from a global information vendor called GlobalData. The data ServiceCo buys lists all the mines owned by different companies (including joint ventures etc.) and has information on those mines (locations, what's being mined, workforce size etc). It also lists key employees at each of those mines. This data is very expensive, in part because it costs GlobalData a great deal of money to collect. The ServiceCo sales team incorporates the GlobaData data into their CRM and successfully goes prospecting. Although the data is expensive, the sales team are extracting value from it and it's worth it to them.

Some time later, a ServiceCo data analyst finds this data in an internal database and they realize it could be useful elsewhere. In conjunction with product management, they execute a plan to use it:

  • They augment the "MiningNetwork" product with GlobalData data. Some of this data ServiceCo already had, but the GlobalData adds new mine sites and new people and is a very significant addition. The data added is directly from the GlobalData data without further processing.
  • They augment their free "Mining Today Report" with the GlobalData data. In this case, it's a very substantial upgrade, increasing the scope of the report by 50% or more. In some cases, the additions to the report are based on conclusions drawn from the GlobalData data, in other cases it's a direct lift (e.g. mine locations). 

Just prior to release, the analyst and the product manager report this work to the ServiceCo CTO and CEO in an internal pre-release demo call. The analyst is really happy to point out that this is a substantial new use for data that the company is paying a great deal of money for.

You are the CEO of ServiceCo. What do you do next and why?

Here's my answer. You ask the data analyst and the product manager if they've done a contract review with your legal team to check that this use of GlobalData's data is within the terms of the contract. You ask for the name of the lawyer they've worked with and you speak to the lawyer before the release goes out. If the answer isn't satisfactory, you stop the projects immediately regardless of any pre-announcements that have been made. 

Why?

These two projects could put the company in substantial legal jeopardy. When you buy data, it usually comes with an agreement specifying allowed uses. Anything else is forbidden. In this case, the data was bought for sales prospecting purposes from a large and experienced data supplier (GlobalData). It's very likely that usage of this data will be restricted to sales prospecting and for internal use only. Bear in mind, GlobalData may well be selling the same sort of data to mining companies and other companies selling to mining companies. So there are likely two problems here:

  1. The GlobalData data will used for purposes beyond the original license agreement.
  2. The GlobalData data will be distributed to others companies free of charge (in the case of "Mining Today Report"), or for charge ("MiningNetwork") with no royalty payments to GlobalData. In effect, ServiceCo will go from being a user of GlobalData data to distributing GlobalData's data without paying them. ServiceCo will be doing this without an explicit agreement from GlobalData. This may well substantially damage GlobalData's business.

The second point is the most serious and could result in a lawsuit with substantial penalties.

The bottom line is simple. When you buy data, it comes with restrictions on how you use it. It's up to you to stick to the rules. If you don't, you may well get sued.

(I haven't mentioned "open source" data so far. Many freely available data sets have licensing provisions that forbid commercial use of the data. If that's the case, you can't use it for commercial purposes. Again, the onus is on you to check and comply.)

What can you do about it?

Fortunately, there are things you can do to manage the risk. Most of the actions revolve around having a repeatable process and/or controls. The nice thing about process and controls is, if something does go wrong, you can often reduce the impact, for example, if you breach the GDPR, you can show you treated it seriously and argue for a lesser fine. 

Let's look at some of the actions you should consider to manage data compliance risk.

Education

Everyone who handles data needs to go through training. This should include:

  • Privacy and PII training.
  • Trading on confidential information.
  • Rules around handling bought in data.
Initially, everyone needs to be trained, but that training needs to be refreshed every year or so. Of course, new employees must be trained too.

Restricted access/queries

Who has access to data needs to be regulated and controlled. For example, who needs to have access to CRM data? Plainly, the sales and marketing teams and the engineering people supporting the product, but who else? Who should not have access to the data? The first step here is to audit access, the second step is to control access, the third step is to set up a continuous monitoring process.

A piece that's often missed is controlling the nature of queries run on the data. The GDPR limits querying on PII data to areas of legitimate business interest. An analyst may well run "initiative" queries to see if the company could extract more value from the data, and that could be problematic. The solution here is education and supervision.

Encryption

There's an old cybersecurity mantra "encrypt data at reset, encrypt data in transit". Your data needs to be encrypted by an appropriately secure algorithm and not one susceptible to rainbow table or other attacks.

Related to encryption is the idea of pseudonymization. To put it simply, this replaces key PII with a string, e.g. "John Smith" might be replaced with "Qe234-6jDfG-j56da-9M02sd", similarly, we might replace his passport number with a string, his credit card number with a string, his IP address with a string, his account number, and so on. The mapping of this PII data to strings is via a database table with very, very restricted access.

As it turns out, almost all analysis you might want to do on PII data works equally well with pseudonymization. For example, let's say you're a consumer company and you want to know how many customers you have in a city. You don't actually need to know who they are, you just need counts. You can count unique strings just the same as you can count unique names. 

There's a lot more to say about this technique, but all I'm going to say now is that you should be using it.

Audit

This is the same as any audit, you go through the organization with a set of questions and checks. An audit is a good idea as an initial activity, but tends to be disruptive. After the initial audit, I favor annual spot checks. 

Standards compliance

There are a ton of standards out there covering data compliance: SOC2, NIST, ISO27000, FedRamp, etc. It's highly likely that an organization will have to comply with one or more of them. You could try and deal with many/most compliance issues by conforming to a standard, but be aware that will still leave gaps. The problem with complying to a standard is that the certification becomes the goal rather than reducing risk. Standards are not enough.

Help line

A lot of these issues are hard for technical people to understand. They need ongoing support and guidance. A good idea is to ensure they know who to turn to to get help. This process needs to be quick and easy. 

(Something to watch out for is management retaliation. Let's say a senior analyst thinks a use of data breaches legal terms but their manager tells them to do nothing. The analyst reaches out to the legal team who confirms that the intended use is a breach. The manager cannot be allowed to retaliate against the analyst.)

The bottom line

As a technical person, you need to treat this stuff seriously. Assuming "common sense" can get you into a lot of trouble. Make friends with your legal team, they're there to help you.