Tuesday, May 27, 2025

What is Model Context Protocol?

Bottom line: MCP is an important technology, but as of May 2025, it's not ready for production deployment. It's immature, the documentation is poor, and it doesn't have the security features it needs. Unless your business has a compelling and immediate need for it, wait a while before starting experimentation.

I've been hearing a lot about MCP and how much of a game-changer it is, but there are three problems with most of the articles I've read:

  • They don't explain the what and the how very well.
  • They're either too technical or too high-level.
  • They smell too strongly of hype.

In this blog post, I'm going to dive into the why at a business level and do some of the how at a more technical level. This is going to be a hype free zone. 

(Chat GPT generated)

What problem are we trying to solve?

AI systems need to access data, but data is accessed in a huge number of ways, making it harder for an AI to connect and use data. MCP is a way of presenting the 'same' interface for all data types.

There are many different data sources, for example: JSON files, CSV files, XML files,  text files, different APIs, different database types, and so on. In any computer language, there are different ways of connecting to these data sources. Here are two Python code snippets that illustrate what I mean:

import requests

res = requests.get(
url="https://www.gutenberg.org/files/132/132-h/132-h.htm",
timeout=(10,5)

) 

and:

import lxml
...
# Open the XML file and parse it
tree = lxml.etree.parse(zip_file_names[0])
...
# Completely parse the first element
root = tree.getroot()
children = root.getchildren()[0].getchildren()

There are a couple of important points here:
  • You use different Python libraries to access different data sources.
  • The API is different.
  • In some cases, the way you use the API is different (e.g. some sources use paging, others don't).
In other words, it can be time consuming and tricky to read in data from different sources.

This is bad enough if you're a programmer writing code to combine data from different sources, but it's even worse if you're an AI. An AI has to figure out what libraries to use, what data's available, whether or not to use paging, etc. In other words, different data source interfaces make life hard for people and for AIs.

There's a related problem, often called the NxM problem. Let's imagine there are M data sources and N LLMs. Each LLM has to create an interface to each data source, so we get a situation that looks like this:

(Claude generated)

This is a huge amount of duplication (NxM). What's worse is if a data source changes its API (e.g, an AWS API update) we have to change N LLMs. If we could find someway of standardizing the interface to the data sources, we would have one set of code for each LLM (M) and one set of code for each data source (N), transforming this into an N+M problem. In this new world, if a data source API is updated, this just means updating one wrapper. Can we find some way of standardizing the interfaces?

(In the old days, this was a problem for hardware too. Desktop PCs would have a display port, an ethernet port, a printer port, and so on. These have pretty much all been replaced with USB-C ports. Can we do something similar in software?)

Some background

There has been a move to consolidate the interface to different sources, but it's been very limited. In the Python world, there's a database access library that lets you connect to most databases using the same interface, but that's about it. Until now, there just hasn't been a strong enough motivation for the community to work out how to provide consistent data access.

I want to go on two slight tangents to explain ideas that are important to MCP. Without these tangents, the choice of name is hard to understand, as are the core ideas.

At the end of the 1970's, Trygve Reenskaug was working at Xerox Parc on UI problems and came up with the Model-View-Controller abstraction. The idea is, a system can be divided into conceptual parts. The Model part represents the business data and the business logic. There's a code interface (API) to the Model that the View and Controller use to access data and get things done. 

The Model part of this abstraction corresponds to the data sources we've been talking about, but it generalizes them to include business logic (meaning, doing something like querying a database). This same abstraction is a feature of MCP too. Sadly, there's a naming conflict we have to discuss. Model means data in Model-View-Controller, but it's also part of the name "large language model" (LLM). In MCP, the M is Model, but it means LLM; the data and business logic is called Context. I'm going to use the word Context from now on to avoid confusion.

Let's introduce another key idea to understand MCP, that of the 'translation' or 'interface' layer. This is a well-known concept in software engineering and comes up a number of times. The best known example is the operating system (OS). An OS provides a standardized way of accessing the same functionality on different hardware. The diagram below shows a simple example. Different manufacturers make different disk drives, each with a slightly different way of controlling the drives. The operating system has a translation layer that offers the same set of commands to the user, regardless of who made the disk drive.

(Chat GPT generated)

Languages like Python rely on these translation layers to work on different hardware.

Let's summarize the key three ideas before we get to MCP:

  • There's been very little progress to standardize data access functionality.
  • The term Context refers to the underlying data and functionality related to that data.
  • Translation layer software allows the same operations to work on different machines.

What MCP is

MCP stands for Model Context Protocol. It's a translation layer on top of a data source that provides a consistent way of accessing different data sources and their associated tools. For example, you can access database data and text files data using the same interface.

  • The Model part of the acronym refers to the LLM. This could be Claude, Gemini, GPT, DeepSeek or one of the many other Large Language Models out there.
  • Context refers to the data and the tools to access it.
  • Protocol refers to the communication between the LLM and the data (Context).

Here's a diagram showing the idea.

What's interesting about this architecture is that the MCP translation layer is a server. More on this later.

In MCP terminology, users of the MCP are called Hosts (mostly LLMs and IDEs like Cursor or Windsurf, but it could be something else). Hosts have Clients that are connectors to Servers. A Host can have a number of Clients; it'll have one for each data source (Server) it connects to. A Server connects to a data source and uses the data source's API to collect data and perform tasks. A Server has functions the Client uses to identify the tasks the Server can perform. A Client communicates with the Server using a defined Protocol.

Here's an expended diagram providing a bit more detail.

I've talked about data sources like XML files etc., but it's important to point out that a data source could be Github, Slack, or Google Sheets or indeed any service. Each of these data sources has their own API and the MCP Server provides a standardizes way of using it. Note that the MCP Server could do some compute intensive tasks too, for example running a time-consuming SQL query on a database.

I'll give you an expanded example for how this all works,  let's say a user asks the  LLM (either standalone or in a tool like Cursor) to create a Github repo:

  • The Model, via its MCP Client, will ask the MCP Server for a list of capabilities for the Github service. 
  • The MCP Server knows what it can do, so it will return a list of available actions, including the ability to create a repo. 
  • The MCP Client will pass this data to the LLM. 
  • Now the Model knows what Github actions it can perform and it can check it can do what the user asked (create a repo). 
  • The LLM instructs its MCP Client to create the repo, which it in turn passes the request to the MCP Server, which in turn formats the request using the Github API. Github creates the repo and returns a status code to the MCP Sever, which in turn informs the Client which in turn informs the Host.

This is a lot of indirection, but it's needed for the whole stack to work.

This page: https://modelcontextprotocol.io/docs/concepts/architecture explains how the stack works in more detail.

How it works

How to set up the Host and Client

To understand the Host and Client set up, you need to understand that MCP is a communications standard (the Protocol part of the name). This means, we only have to tell the Client small amounts of information about the Server, most importantly, it's location. Once it knows where the Server is, it can talk to it.

In Cursor (a Host), there's an MCP setting where we can tell Cursor about the MCP Servers we want to connect to. Here's the JSON to connect to the Github MCP Server:

{
"mcpServers": {
"github": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"-e",
"GITHUB_PERSONAL_ACCESS_TOKEN",
"mcp/github"
],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "<YOUR_TOKEN>"
}
}
}
}

In this example, the line "mcp/github" is the location of the GitHub MCP server.

Setup is similar for LLMs, for example, the Claude desktop. 

I'm not going to explain the above code in detail (you should look here for details of how the Client works). You should note a couple of things:

  • It's very short.
  • It's terse.
  • It has some security (the Personal Access Token).

How to set up the MCP Server

MCP Servers have several core concepts:
  • Resources. They expose data to your Host (e.g. the LLM) and are intended for light-weight and quick queries that don't have side effects, e.g. a simple data retrieval.
  • Tools. They let the Host tell the Server to take an action. They can be computationally expensive and can have side effects.
  • Prompts. These are templates that standardize common interactions.
  • Roots and Sampling. These are more advanced and I'm not going to discuss them here.

These are implemented in code using Python function decorators, a relatively new features of Python.

Regardless of whether it's Prompts, Tools, or Resources, the Client has to discover them, meaning, it has to know what functionality is available. This is done using discovery functions called list_resources, list_prompts, and of course list_tools. So the Client calls the discovery functions to find out what's available and then calls the appropriate functions when it needs to do something. 

Resources

Here's are two examples of resource function. The first function lets the Client find out what resources are available, which in this case is a single resource, the application log. The second function is how the Client can access the application log contents.

@app.list_resources()
async def list_resources() -> list[types.Resource]:
return [
types.Resource(
uri="file:///logs/app.log",
name="Application Logs",
mimeType="text/plain"
)
]

@app.read_resource()
async def read_resource(uri: AnyUrl) -> str:
if str(uri) == "file:///logs/app.log":
log_contents = await read_log_file()
return log_contents

raise ValueError("Resource not found")

Note the use of async and the decorator.  The async allows us to write efficient code for tasks that may take some time to complete.

Tools

Here's an example of two tool functions. As you might expect by now, the first function lets the Client discover which tools it can call.

@app.list_tools()
async def list_tools() -> list[types.Tool]:
return [
types.Tool(
name="calculate_sum",
description="Add two numbers together",
inputSchema={
"type": "object",
"properties": {
"a": {"type": "number"},
"b": {"type": "number"}
},
"required": ["a", "b"]
}
)
]

The second function is a function the Client can call once the Client has discovered it. 

@mcp.tool()
async def fetch_weather(city: str) -> str:
"""Fetch current weather for a city"""
async with httpx.AsyncClient() as client:
response = await client.get(f"https://api.weather.com/{city}")
return response.text

Here, the code is calling out to an external API to retrieve the weather for a city. Because the external API might take some time, the code uses await and async. This is a tool rather than a resource because it may take some time to complete.

Prompts

This is a longer code snippet to give you the idea. The list_prompts function is key: this is how the Client finds out the available prompts.

PROMPTS = {
"git-commit": types.Prompt(
name="git-commit",
description="Generate a Git commit message",
arguments=[
types.PromptArgument(
name="changes",
description="Git diff or description of changes",
required=True
)
],
),
"explain-code": types.Prompt(
name="explain-code",
description="Explain how code works",
arguments=[
types.PromptArgument(
name="code",
description="Code to explain",
required=True
),
types.PromptArgument(
name="language",
description="Programming language",
required=False
)
],
)
}
...
@app.list_prompts()
async def list_prompts() -> list[types.Prompt]:
return list(PROMPTS.values())
...

@app.get_prompt()
async def get_prompt(
name: str, arguments: dict[str, str] | None = None
) -> types.GetPromptResult:
if name not in PROMPTS:
raise ValueError(f"Prompt not found: {name}")

if name == "git-commit":
changes = arguments.get("changes") if arguments else ""
return types.GetPromptResult(
messages=[
types.PromptMessage(
role="user",
content=types.TextContent(
type="text",
text=f"Generate a concise but descriptive commit message "
f"for these changes:\n\n{changes}"
)
)
]
)

You can read more about how prompts work in the documentation: https://modelcontextprotocol.io/docs/concepts/prompts#python

Messages everywhere

The whole chain of indirection relies on JSON message passing between code running in different processes. This can be difficult to debug. You can read more about MCP's message passing here: https://modelcontextprotocol.io/docs/concepts/transports

Documents, tutorials, and YouTube

At the time of writing (May 2025), the documentation for MCP is very sparse and lacks a lot of detail. There are a few tutorials people have written, but they're quite basic and again lack detail. What this means is, you're likely to run into issues that may take time to resolve.

There are videos on YouTube, but most of them have little technical content and seem to be hyping the technology rather than offering a thoughtful critique or a guide to implementation. Frankly, don't bother with them.

Skills needed

This is something I've hinted at in this blog post, but I'm going to say it explicitly. The skill level needed to implement a non-trivial MCP is high. Here's why:

  • The default setup process involves using uv rather than the usual pip.
  • The MCP API makes extensive use of function decorators, an advanced Python feature.
  • The Tools API uses async and await, again more advanced features.
  • Debugging can be hard because MCP relies on message passing.

The engineer needs to know about function decorators, asynchronous Python, and message passing between processes.

Where did MCP come from?

MCP was released by Anthropic in November 2024. After a "slowish" start, it's been widely adopted and has now become the dominant standard. Anthropic have open-sourced the entire protocol and placed it on GitHub. Frankly, I don't see anything usurping it in the short term.

Security and cost

This is a major concern. Let's go back to this diagram:

There could be three separate companies involved in this process:

  • The company that wants to use the LLM and MCP, we'll call this the User company.
  • The company that hosts the LLM, we'll call this the LLM company.
  • The company that hosts the data source, we'll call this the Data company.

The User company starts a job that uses an LLM in the LLM company. The job uses computationally (and $ costly) resources located at the Data company. Let's say something goes wrong, or the LLM misunderstands something. The LLM could make multiple expensive calls to the data source through the MCP Server, racking up large bills. Are there ways to stop this? Yes, but it takes some effort. 

The other concern is a hacked remote LLM, Remember, the LLM has the keys to the kingdom for your system, so hackers really could go to town, perhaps making rogue calls to burn up expensive computing resources or even writing malicious data.

There are a number of other concerns that you can read more about here: https://www.pillar.security/blog/the-security-risks-of-model-context-protocol-mcp and here: https://community.cisco.com/t5/security-blogs/ai-model-context-protocol-mcp-and-security/ba-p/5274394

The bottom line is, if you're running something unattended, you need to put guard rails around it

Complexity - everything is a server?

As I've stated, this is a very complex beast under the hood. The LLM will run in its own process, the MCP Server will run in its own process, and maybe the underlying data sources will too (e.g. a web-based resource or a database). If any of these processes fail, then the whole system fails. If a system fails, the developers have to debug which of all these servers failed first. Inter-process communication is harder than simple procedure calls which means debugging is too. 

All of the examples I've seen on the web have been relatively simple. I'm left wondering how complex it would be to develop a robust system with full debugging for something like a large-scale database. I'm not sure I want to be first to find out.

How can I get started?

I couldn't find tutorials or articles that are good enough for me to recommend. That of itself is telling.

Where we stand today

MCP was released in November 2024 and it's still an immature standard. 

  • Security in particular is not where it needs to be; you need to put guard rails up. 
  • Documentation is also sorely lacking and there are very few good tutorials out there. 
  • Debugging can be very hard, the message passing infrastructure is more difficult to work with than a simple call stack.

Sadly, the hype machine has really got going and you would think that MCP is ready for prime time and immediate deployment - it's not. This is definitely an over-hyped technology for where we are now

Should you experiment with MCP? Only if you have a specific reason to, and then only with supervision and risk management. If you have the right use case, this is a very compelling technology with a lot of promise for the future.

Monday, May 19, 2025

What is a random variable?

Just because we can't predict something exactly doesn't mean we can't say anything about it at all

There are all kinds of problems where we can't say exactly what the value of something is, but we can still say useful things about it. Here are some examples.

  • The number of goals scored in a football or hockey match.  We might not be able to predict the number of goals scored in a particular match, but we can say something:
    • We know that the number of goals must be an integer greater than or equal to 0.
    • We know that the number of goals is likely to be low and that high scores are unlikely; seeing two goals is far more likely than seeing 100 goals.
  • The number of people buying tickets at a movie theater. We know this will depend on the time of year, the day of the week, the weather, and the movies playing, etc. but even allowing for these factors, there's randomness. People might go on dates (or cancel them) or decide on a whim to see a movie. In this case, we know the minimum tickets is zero, the maximum is the number of seats, and that only an integer number of tickets can be sold. 
  • The speed of a car on the freeway. Plainly, this is affected by a number of factors, but there's also randomness at play. We know the speed will be a real number greater than zero. We know that in the absence of traffic, it's more likely the car will be traveling at the speed limit than say 20mph.
  • The score you get by rolling a dice.
(Dietmar Rabich / Wikimedia Commons / “Würfel, gemischt -- 2021 -- 5577” / CC BY-SA 4.0

For print products: Dietmar Rabich / https://commons.wikimedia.org/wiki/File:W%C3%BCrfel,_gemischt_--_2021_--_5577.jpg / https://creativecommons.org/licenses/by-sa/4.0/
Alternatively: Dietmar Rabich / https://w.wiki/9A49 / https://creativecommons.org/licenses/by-sa/4.0/)

In all these cases, we're trying to measure something, but randomness is at play, which means we can't predict an exact result, but we can still make probabilistic predictions. We can also do math with these predictions, which means we can use them to build computer models and make predictions about how a system might behave.

The variables we're trying to measure are called random variables and I'm going to describe what they are in this blog post. I'm going to start by providing some background ideas we'll need to understand, then I'm going to show you why random variables are useful.

What is a mathematical function?

Functions are going to be important to this story, so bear with me.

In math, a function is some operation where you give it some input and it produces some output. The classic examples you may remember are the trigonometric functions like \(sin(x)\), \(cos(x)\), and \(tan(x)\). A function could have several inputs, for example, this is a function: \(z = a_0 + a_1x^1 + a_2 y^3\).

Functions are very common in math, so much so that it can be a little hard to spot them, as we'll see.

Describing randomness - distributions

A probability distribution is a math function that tells you how likely the outcome of a process is. For example, a traffic light can be red, yellow, or green. How likely is it that the next traffic light I come to will be red, yellow, or green? It must be one of them, so the probabilities must sum to one, but we know that yellow is shorter than red or green, so yellow is less likely. Obviously, we can discuss the relative likelihood of red or green.

Probability distributions can get very complicated, but many of them follow well-known patterns. For example, when rolling an unbiased dice, the probability distribution is a discrete uniform distribution that looks like this:

the number of goals scored in a hockey or football match is known to be well-modeled by a (discrete) Poisson distribution that looks like this:

male (or female) heights are well-modeled by a (continuous) normal distribution that looks like this:

There are hundreds of known distributions, but in practice, only a few are "popular".

Discrete or continuous

There are two type of measurements we typically take: continuous and discrete.

Discrete measurements are things that come in discrete chunks, for example, the number of sheep in a flock, the number of goals in a match, the number of people in a movie theater, and so on. Categorical variables are "sort of" discrete, for example the colors of a traffic light, though they are a special case.

Continuous measurements are things that can take any value (including any number of digits after the decimal point). For example, the speed of a car on the freeway could be 72.15609... mph, someone's height might be 183.876... cm and so on. 

This seems clear, but sometimes we muddy the waters a bit. Let's say we're measuring height and we measure in whole cm. This transforms the measurement from a continuous one to a discrete one.

There are two types of probability distribution: continuous and discrete. We use continuous distributions for continuous quantities and discrete for discrete quantities. You should note that in the real world, it's often not this simple.

Random variables

A random variable is a math function the output of which depends on some random process. The values of the random variable follow a probability distribution. Here are some examples of observations that we can describe using random variables:

  • the lifetime of a lightbulb
  • goals scored
  • the result of rolling a dice
  • the speed of cars of a freeway
  • the height of a person
  • sales revenue

Dice are easy to understand, so I'll use it as an example. We don't know what the result of throwing the dice will be, but we know the probability distribution is uniform discrete, so the probability of throwing a 1 is \(\dfrac{1}{6}\), the probability of throwing a 2 is \(\dfrac{1}{6}\), and so on. Let's say we're gambling on dice, betting $1 and winning $6 if our number comes up. Using random variable math, we can work out what our gain or loss might be. In the dice example, it's trivial, but in other cases, it gets harder and we need some more advanced math. 

Random variables have a set of all possible results, which can be finite or infinite, that's called the sample space. The sample space is a set denoted by \(\Omega\). For the dice example, the sample space is simply:

\[\Omega = \{1,2,3,4,5,6\}\]

For a continuous quantity, like the lifetime of a bulb:

\[\Omega =  \{x | x ∈ \mathbb{R} \} \]

which means an infinite sample space. 

Infinite sample spaces, or large discrete sample spaces means we can't work things out by hand, we need more powerful math to do anything useful, and that's where things get hard.

A measurement (or observation) is the process of selecting a value from the sample space. Remember, the random variable has a probability distribution that tells you how likely different values are to be selected. 

Arithmetic with random variables - doing something useful

In this section and the next, I'll start to show you some interesting things you can do with random variables. To illustrate a key idea, we'll use a simple example. We'll work out the probability distribution for the combined scores we get by throwing two unbiased dice. 

We know the distribution is uniform for both dice, so we could work it out by hand like this:

Table 1: combining the scores of two dice
Dice 1 Dice 2 Combined score Probability
1 1 2 \(\dfrac{1}{36}\)
1 2 3 \(\dfrac{1}{36}\)
1 3 4 \(\dfrac{1}{36}\)
...
2 1 3 \(\dfrac{1}{36}\)
2 2 4 \(\dfrac{1}{36}\)
2 3 5 \(\dfrac{1}{36}\)
...

the next step is adding up the probabilities of the combined scores:

  • there's only one way of getting 2, so it's probability is \(\dfrac{1}{36}\)
  • there are two ways of getting 3, so it's probability is \(\dfrac{1}{36} + \dfrac{1}{36}\)
  • ...

this is really tedious, and obviously would be hugely expensive for a large sample space. There's a much faster way I'm going to show you.

To add two random variables, we use a process called convolution. This is a fancy way of saying we multiply the elements of one random variables by all the elements of the other random variable and add the probabilities. Mathematically, it looks like this for a discrete random variables, where \(f\) is the distribution for the first dice and \(g\) the distribution for the second dice:

\[f * g[n] = \sum_{m=-M}^{M}{f[n-m]g[n]}\]

In Python, we need to do it in two stages: work out the sample space and work out the probabilities. Here's some code to do it for two dice.  

import numpy as np

score1, score2 = np.arange(1, 7), np.arange(1, 7)
prob1, prob2 = np.ones(6) / 6, np.ones(6) / 6

combo_score = list(range(score1[0] + score2[0], score1[-1] + score2[-1] + 1))
combo_prob = np.convolve(prob1, prob2)

print(combo_score)
print(combo_prob)

This is easy to do by hand for two dice, but not when the the data sets get a lot bigger, that's when we need computers.

The discrete case is easy enough, but the continuous case is harder and the math is more advanced. Let's take an example to make things more concrete. Let's imagine a company with two sales areas. An analyst is modeling them as continuous random variables. How do we work out the total sales? The answer is continuous convolution of the two sales areas and here's the answer:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau) \,d\tau\]

This is obviously a lot more complicated. It's so complicated, I'm going to spend a little time explaining how to do it.

Broadly speaking, there are three approaches to continuous convolution: special cases, symbolic calculation, and discrete approximations.

In a handful of cases, convolving two continuous random variables has known answers. For example, convolving normal distributions gives a normal distribution and convolving uniform distributions gives an Irwin-Hall distribution.

In almost all cases, it's possible to do a symbolic calculation using integration. You might think that something like SymPy could do it, but in practice, you need to do it by hand. Obviously, you need to be good at calculus. There are several textbooks that have some examples of the process and there are a number of discussions on StackOverflow. From what I've seen, college courses in advanced probability theory seem to have course questions on convolving random variables with different distributions and students have asked for help with them online. This should give you an inkling of the level of difficulty.

The final approach is to use discrete approximations to continuous functions and use discrete convolution. This tends to be the default in most cases.

Worked example with random variables: predicting revenue and net income

Let's say we want to model the total sales revenue (\(t\)) from several regions (\(s_0, s_1, ...s_n\)) that are independent. We also have a model of expenses for the company as a whole (\(e\)). How can we model total revenue and net income?

Let's assume the sales revenue in each region is modeled by random variables, each having a normal distribution. We have mean values \(\mu_0, \mu_1, ..\mu_n\) and standard deviations \(\alpha_0, \alpha_1, ...\alpha_n\). To get total sales, we have to do convolution:

\[t = s_0 * s_1 * ... * s_n\]

This sounds complicated, but for the normal distribution, there's a short-cut. Convolving normal with normal gives normal, all we have to do is add the means and the variances. So the total sales number is a normal distribution with mean and variance:

\[\mu = \sum_{i=0}^{n}\mu_i\]

\[\alpha^2 = \sum_{i=0}^{n}\alpha_{i}^{2}\]

Getting net income is tiny bit harder. If you remember your accountancy text books, net income \(ni\) is:

\[ni = t - e\]

If expenses are modeled by the normal distribution, the answer here is just a variation of the process I used for combining sales. But what if expenses are modeled by some other distribution? That's where things get tough. 

Combining random variables with different probability distributions is hard. There's no good inventory I could find on the web of known solutions. You can do the symbolic calculation by hand, but that requires a good grasp of calculus. You might think that something like SymPy would work, but at the time of writing, SymPy doesn't have a good way of doing it. The final way of doing it is using a discrete approximation, but that's time consuming to do.  Bottom line: there's no easy solution if the distributions aren't all normal or aren't all uniform.

Division and multiplication with random variables

Most problems using random variables seem to boil down to adding them. If you need to multiply or divide random variables, there are ways to do it. The book "The Probability Lifesaver" by Stephen J. Miller  explains how.

Minimum, maximum, and expected values

I said that convolving random variables can be very hard, but getting some values is pretty straightforward.

The maximum of two random variables \(f\) and \(g\) is simply \(max(f) + max(g)\)

The minimum of two random variables \(f\) and \(g\) is simply \(min(f) + min(g)\)

What about the mean? It turns out, getting the mean is easy too. The mean value of a random variable is often called the expectation value and is the result of a function called \(E\), so the mean of a random value \(X\) is \(E(X)\).  The formula for the mean of two random variables is:

\[E(X + Y) = E(X) + E(Y)\]

In simple words, we add the means. 

Note I didn't say what the underlying distributions were. That's because it doesn't matter.

What if we apply some function to a random variable? It turns out, you can calculate the mean of a function of a random variable fairly easily and the arithmetic for combining multiple means is well known. There are pages on Wikipedia that will show you how to do it (in general, search for "linear combinations of expectation values" to get started).

Bringing it all together

There are a host of business and technical problems where we can't give a precise answer, but we can model the distribution of answers using random variables. There's a ton of theory surrounding the properties and uses of random variable, but it does get hard. By combining random variables, we can build models of more complicated systems, for example, we could forecast the range of net incomes for a company for a year. In some cases (e.g. normal distributions), combining random variables is easy, in other cases, it takes us in the world of calculus or discrete approximations. 

Yes, random variables are hard, but they're very powerful.

Wednesday, May 14, 2025

You need to use Manus

What is Manus - agentic AI

Manus is an AI agent capable of performing a number of high-level tasks that previously could only be done by humans. For example, it can research an area (e.g. a machine learning method) and produce an intelligible report, it can even turn a report into an interactive website. You can get started on it for free.

It created a huge fuss on its release, and rightly so. The capabilities it offers are ground-breaking. We're now a few months later and it's got even better.

In this blog post, I'm going to provide you with some definitions, show you what Manus can do, give you some warnings, and provide you with some next steps.

If you want to get an invitation to Manus, contact me.

How it works 

We need some definitions here. 

An LLM (Large Language Model) is a huge computer model that's been trained on large bodies of text. That could be human language (e.g. English, Chinese) or it could be computer code (e.g. Python, JavaScript). An LLM can do things like:

  • extract meaning from text e.g. given a news article on a football match, it can tell you the score, who won, who lost, and other details from the text
  • predict the next word in a sentence or the next sentence in a paragraph
  • produce entire "works", for example, you can ask an LLM to write a play on a given theme.

A agent is an LLM that controls other LLMs without human intervention. For example, you might set it the task of building a user interface using react.js. The agent will interpret your task and break it down to several sub tasks. It will then ask LLMs to build code for each sub task and stitch the code together.  More importantly for this blog post, you can use an agent to build a report for you on a topic. The agent will break down your request into chunks, assign those chunks to LLMs, and build an answer for you. An example topic might be "build me a report on what to do during a 10 day vacation in Brazil".

Manus is an agentic AI. It will split your request into chunks, assign those chunks to LLMs (it could be the same LLM or it could be different ones depending on the task), and combine the results into a report.

An example

I gave the following instructions to Manus:

You are an experienced technical professional. You will write a report explaining how logistic regression works for your colleagues. Your report will be a Word document. Your report will include the following sections:

* Why logistic regression is important.

* The theory and math behind it.

* A worked example. This will include code in Python using the appropriate libraries.

You will include the various math formula using the correct notation. You will provide references where appropriate.

Here's how it got started:


After it started, I realized I needed to modify my instructions, here's the dialog:

It incorporated my request and did add more sections.

Here's an example of how it kept me updated:

After 20 minutes, it produced a report in Word format. After reading the report, I realized I wanted to turn it into a blog post, so I asked Manus to give me the report as a HTML document, which it did. 

I've posted the report as a blog post and you can read it here: https://blog.engora.com/2025/05/the-importance-of-logistic-regression.html

A critique of the Manus report

I'm familiar with logistic regression so I can critique what Manus returned. I'd give it a B+. This may sound a bit harsh, but that's a very credible result for 20 minutes of effort. It's enough to get going with but it's not enough of itself. Here's my assessment.

  • Writing style and use of English. Great. Better than most native English speakers.
  • Report organization. Great. Very clear and concise. Nicely formatted.
  • Technically correctness. I couldn't spot anything wrong with what it produced. It did miss important stuff out though and did have some oddities:
    • Logistic regression with more than two target variables, no mention of it.
    • Odds ratio can vary from from 0 to +\(\infty\) but it didn't mention it. This is curious as it pointed out that linear regression can vary from -\(\infty\) to +\(\infty\) in the prior paragraphs.
    • Too terse description of the sigmoid function. It should have included a chart and it should have had a deeper discussion of some of the relevant properties of the function.
    • No meaningful discussion of decision boundaries (one mention in not enough detail).
  • Formula. A curious mixed bag. In some cases, it gave very good formula using the standard symbols and in other cases it gave code-like formula. This might be because I told it I wanted a Word report. By default, it uses markdown and it may be better to keep the report in markdown. It might be worth experimenting telling it use Latex for formula.
  • Code. Great.
  • References. Not great. No links back to the several online books that talk about logistic regression in some detail. No links to academic papers. The references it did provide were kind of OK, but really not enough and overall, not high quality enough.

To fix some of these issues, I could have tweaked my prompt, for example, telling it to use academic references, or giving it instructions to expand certain areas etc. This would cost more tokens. I could have told it to use high-effort reasoning which would also have cost me more tokens. 

Tokens in AI

Computation isn't free and that's especially true of AI. Manus, in common with many other AI services, uses a "token" model. This report cost me 511 tokens. Manus gives you a certain number of tokens for free, which is enough for experimentation but not enough for commercial use.

What's been written about it

Other people have written about Manus too. Here are some reviews:

Who owns Manus

Manus is owned by a Chinese company called Monica (also known as Butterfly Effect AI) based in Wuhan.

Some cautions

As with any LLM or agentic AI, I suggest that you do not share company confidential information or PII. This includes data, but also includes text. Some LLMs/agents will use any data (including text) you supply to help train their models. This might be OK, but it also might not be OK - proceed with caution.

Before you use any agentic AI or an LLM for "production" use, I suggest a legal and risk review.

  • What does their system do with the data you send it? Does it retain the data, does it train the model? Is it resold?
  • What does their system do with the output (e.g. final report, generated code)? 
  • Can you ask for your data to be removed from their model or system?

What this means - next steps

These types of agentic AI are game-changers. They will get you information you need far faster and far cheaper than a human could do it. The information isn't perfect and perhaps you wouldn't give it an A, but it's more than good enough to get started and frankly, most humans don't produce A work.

If you're involved in any kind of knowledge work, you should be experimenting with Manus and its competitors. This technology has obvious implications for employment and if you think you might be affected, it behoves you to understand what's going on.

If you want to get started, reach out to me to get an invitation to Manus and get extra free tokens.

The Importance of Logistic Regression

Note

With the exception of this note, everything else on this blog post was automatically created by Manus. I'm providing it as an example of what you can create.

In this separate blog post, I explain how I created this report and I provide an evaluation of it.

If you wanted to get started with Manus, contact me and I'll share an invitation with you.

Mike

======================================

The Importance of Logistic Regression

Logistic regression stands as a cornerstone in the field of machine learning and statistics, primarily recognized for its efficacy in tackling binary classification problems. Its importance stems from a combination of its interpretability, efficiency, and the foundational understanding it provides for more complex algorithms. Unlike linear regression, which predicts continuous outcomes, logistic regression is specifically designed to predict the probability of an instance belonging to a particular class, typically one of two (e.g., yes/no, true/false, 0/1). This probabilistic output is crucial in many real-world scenarios where a clear-cut decision boundary is needed, but an understanding of the likelihood of each outcome is also valuable.

One of the key reasons for logistic regression’s widespread adoption is its relative simplicity and ease of implementation. It serves as an excellent starting point for individuals venturing into predictive modeling and classification tasks. The mathematical underpinnings, while involving concepts like the sigmoid function and log-odds, are generally more accessible than those of more sophisticated models like neural networks or support vector machines. This accessibility does not, however, detract from its power. Logistic regression can provide robust and accurate predictions, especially when the relationship between the independent variables and the log-odds of the dependent variable is approximately linear.

Furthermore, the interpretability of logistic regression models is a significant advantage. The coefficients derived from a trained logistic regression model can be directly interpreted in terms of the odds ratio. This allows practitioners to understand the influence of each independent variable on the likelihood of the outcome. For instance, in a medical diagnosis scenario, a logistic regression model can not only predict the probability of a patient having a certain disease but also quantify how factors like age, weight, or specific test results contribute to that probability. This level of insight is invaluable in fields where understanding the ‘why’ behind a prediction is as important as the prediction itself.

Logistic regression is also computationally efficient, making it suitable for large datasets and real-time applications. Training a logistic regression model is generally faster compared to more complex algorithms, and making predictions is also quick. This efficiency, combined with its good performance on many binary classification tasks, makes it a go-to algorithm for a wide range of applications. These applications span various domains, including medical diagnosis (e.g., predicting disease presence), finance (e.g., credit scoring, fraud detection), marketing (e.g., predicting customer churn or purchase likelihood), and social sciences (e.g., predicting voting behavior).

Moreover, logistic regression serves as a fundamental building block for understanding more advanced classification techniques. Many concepts introduced in logistic regression, such as the use of a link function (the sigmoid function), maximum likelihood estimation for parameter fitting, and the evaluation of model performance using metrics like accuracy, precision, recall, and AUC-ROC, are transferable to other machine learning algorithms. Therefore, a solid grasp of logistic regression provides a strong foundation for learning and applying more complex models.

In summary, the importance of logistic regression is multifaceted. It is a powerful yet relatively simple and interpretable classification algorithm that provides probabilistic outputs. Its computational efficiency, wide range of applications, and its role as a foundational concept in machine learning solidify its place as an essential tool in the data scientist’s and statistician’s toolkit. Whether used as a standalone model or as a baseline for comparison with more complex methods, logistic regression continues to be a highly relevant and valuable technique in the world of data analysis and predictive modeling.

The Theory and Math Behind Logistic Regression

Logistic regression, despite its name, is a statistical model used for binary classification tasks, meaning it predicts the probability of an instance belonging to one of two classes. The core idea is to model the probability that a given input point belongs to a certain class. To understand its mechanics, we need to delve into concepts like the odds, the logit function, the sigmoid (or logistic) function, and the method of maximum likelihood estimation for fitting the model.

From Linear Regression to Probabilities

Linear regression predicts a continuous output, y, based on a linear combination of input features, X. The equation for a simple linear regression with one feature is y = β₀ + β₁x. For multiple features, this becomes y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ. However, the output of linear regression can range from -∞ to +∞, which is not suitable for probabilities that must lie between 0 and 1.

To address this, logistic regression transforms the linear combination of inputs using a function that maps any real-valued number into the (0, 1) interval. This function is the sigmoid function, also known as the logistic function.

The Sigmoid (Logistic) Function

The sigmoid function is defined as:

σ(z) = 1 / (1 + e^(-z))

Here, ‘z’ represents the linear combination of input features and their corresponding coefficients (weights): z = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ. The output of the sigmoid function, σ(z), is the estimated probability P(Y=1|X), i.e., the probability that the dependent variable Y is 1 (e.g., ‘pass’, ‘yes’, ‘disease present’) given the input features X. As z approaches +∞, e^(-z) approaches 0, and σ(z) approaches 1. Conversely, as z approaches -∞, e^(-z) approaches +∞, and σ(z) approaches 0. This S-shaped curve is ideal for modeling probabilities.

Odds and Log-Odds (Logit)

To understand the derivation of the logistic regression model, it’s helpful to consider the concept of odds. The odds of an event occurring is the ratio of the probability of the event occurring to the probability of it not occurring:

Odds = P(Y=1|X) / P(Y=0|X)

Since P(Y=0|X) = 1 - P(Y=1|X), we can write:

Odds = P(Y=1|X) / (1 - P(Y=1|X))

If we let p(X) = P(Y=1|X) = σ(z) = 1 / (1 + e^(-z)), then:

1 - p(X) = 1 - [1 / (1 + e^(-z))] = (1 + e^(-z) - 1) / (1 + e^(-z)) = e^(-z) / (1 + e^(-z))

So, the odds become:

Odds = [1 / (1 + e^(-z))] / [e^(-z) / (1 + e^(-z))] = 1 / e^(-z) = e^z

Now, taking the natural logarithm of the odds gives us the log-odds, also known as the logit function:

logit(p(X)) = ln(Odds) = ln(e^z) = z

Thus, we have:

ln(p(X) / (1 - p(X))) = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ

This equation shows that the log-odds of the outcome is a linear function of the input features. This is the fundamental relationship that logistic regression models. The coefficients (β) can be interpreted in terms of the change in log-odds for a one-unit change in the corresponding feature, holding other features constant. Exponentiating a coefficient gives the odds ratio.

Model Fitting: Maximum Likelihood Estimation (MLE)

Unlike linear regression, where coefficients are typically estimated using Ordinary Least Squares (OLS), logistic regression coefficients are estimated using Maximum Likelihood Estimation (MLE). MLE is a method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the given data.

For a dataset with ‘n’ independent observations {(xáµ¢, yáµ¢)}, where xáµ¢ is the vector of features for the i-th observation and yáµ¢ is its binary outcome (0 or 1), the likelihood function L(β) is the product of the probabilities of observing each yáµ¢ given xáµ¢ and the parameters β:

L(β) = Πᵢ [p(xᵢ) ^ yᵢ] * [(1 - p(xᵢ)) ^ (1 - yᵢ)]

where p(xáµ¢) = σ(β₀ + β₁x₁áµ¢ + … + βₚxₚᵢ) is the predicted probability for the i-th observation.

It is often easier to work with the log-likelihood function, ll(β), because it converts the product into a sum:

ll(β) = ln(L(β)) = Σᵢ [yᵢ * ln(p(xᵢ)) + (1 - yᵢ) * ln(1 - p(xᵢ))]

Substituting p(xáµ¢) = 1 / (1 + e^(-záµ¢)) and 1 - p(xáµ¢) = e^(-záµ¢) / (1 + e^(-záµ¢)), where záµ¢ = β₀ + β₁x₁áµ¢ + … + βₚxₚᵢ, the log-likelihood becomes:

ll(β) = Σᵢ [yᵢ * zᵢ - ln(1 + e^(zᵢ))]

To find the values of β that maximize this log-likelihood function, we typically use iterative optimization algorithms like Gradient Ascent (since we are maximizing) or Newton-Raphson. These algorithms start with initial estimates for β and iteratively update them until the log-likelihood converges to a maximum. There is no closed-form solution for the β coefficients in logistic regression, unlike in linear regression.

Assumptions of Logistic Regression

While logistic regression is more flexible than linear regression, it still relies on a few key assumptions:

  1. Binary Dependent Variable: The dependent variable must be binary or dichotomous (e.g., 0/1, yes/no). For more than two categories, extensions like multinomial or ordinal logistic regression are used.
  2. Independence of Observations: The observations should be independent of each other. This is a common assumption for many statistical models.
  3. Linearity of Log-Odds: The relationship between the independent variables and the log-odds of the outcome is assumed to be linear. This can be checked using techniques like the Box-Tidwell test or by plotting residuals.
  4. Absence of Multicollinearity: There should be little or no multicollinearity among the independent variables. High multicollinearity can make it difficult to estimate the individual effects of the predictors.
  5. Large Sample Size: Logistic regression typically requires a reasonably large sample size to achieve stable and reliable estimates of the coefficients.

Understanding these theoretical and mathematical underpinnings is crucial for effectively applying logistic regression, interpreting its results, and diagnosing potential issues.

Worked Example: Logistic Regression in Python

This section provides a practical, step-by-step demonstration of how to implement logistic regression using Python. We will leverage popular libraries such as pandas for data manipulation, scikit-learn for machine learning tasks including model building and evaluation, and numpy for numerical operations. For this example, we will use the well-known Breast Cancer Wisconsin (Diagnostic) dataset, which is conveniently available within scikit-learn. This dataset presents a binary classification problem: predicting whether a breast mass is malignant or benign based on several computed features from digitized images of fine needle aspirates (FNA).

1. Importing Necessary Libraries

The first step in any Python-based data science task is to import the required libraries. We will need pandas for creating and managing DataFrames, numpy for numerical computations (though its direct use might be minimal here, it underpins scikit-learn), and several modules from scikit-learn for data splitting, model implementation, preprocessing, and metrics.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_breast_cancer # Using a built-in dataset for simplicity

2. Loading and Exploring the Dataset

We load the breast cancer dataset using load_breast_cancer() from sklearn.datasets. The data and feature names are then used to create a pandas DataFrame for easier manipulation and inspection. The target variable, indicating whether a tumor is malignant (1) or benign (0), is added as a new column to this DataFrame.

# Load the dataset
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target

Before proceeding with modeling, it is crucial to perform some initial exploratory data analysis (EDA). We display the first few rows of the DataFrame using df.head() to get a feel for the data, df.info() to understand the data types and check for missing values, and df["target"].value_counts() to see the distribution of the target classes.

print("--- Dataset Head ---")
print(df.head())
print("\n--- Dataset Info ---")
df.info()
print("\n--- Target Value Counts ---")
print(df["target"].value_counts())

This initial exploration helps confirm that the dataset is loaded correctly, identify the nature of the features (all appear to be numerical in this case), and understand the balance of the classes in the target variable, which is important for classification tasks.

3. Defining Features and Target Variable

Next, we separate the dataset into features (independent variables, denoted as X) and the target variable (dependent variable, denoted as y). X will contain all columns except the ‘target’ column, and y will consist solely of the ‘target’ column.

# Define features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

4. Splitting Data into Training and Testing Sets

To evaluate the performance of our logistic regression model on unseen data, we split the dataset into a training set and a testing set. The model will be trained on the training set, and its predictive performance will be assessed on the testing set. We use train_test_split from sklearn.model_selection for this purpose. A common split is 80% for training and 20% for testing. Setting random_state ensures that the split is the same every time the code is run, making the results reproducible. The stratify=y argument ensures that the proportion of the target classes is maintained in both the training and testing sets, which is particularly important for imbalanced datasets.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\n--- Shape of Training Data ---")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"--- Shape of Testing Data ---")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

5. Feature Scaling

Many machine learning algorithms, including logistic regression (especially when using certain solvers like ‘lbfgs’ or when regularization is applied), perform better when the input numerical features are on a similar scale. Feature scaling standardizes the range of independent variables. We use StandardScaler from sklearn.preprocessing, which standardizes features by removing the mean and scaling to unit variance. The scaler is fit only on the training data to prevent data leakage from the test set, and then used to transform both the training and testing data.

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

6. Initializing and Training the Logistic Regression Model

With the data prepared, we can now initialize and train our logistic regression model. We create an instance of the LogisticRegression class from sklearn.linear_model. For this example, we specify the solver="liblinear", which is a good choice for smaller datasets and binary classification, and set random_state for reproducibility. The max_iter parameter is increased to ensure the solver has enough iterations to converge. The model is then trained using the fit() method with the scaled training features (X_train_scaled) and the training target variable (y_train).

# Initialize and train the Logistic Regression model
log_reg_model = LogisticRegression(solver="liblinear", random_state=42, max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)

print("\n--- Model Training Complete ---")

7. Making Predictions

Once the model is trained, we can use it to make predictions on the test set (X_test_scaled). The predict() method returns the predicted class labels (0 or 1 in this case). We also use the predict_proba() method to obtain the predicted probabilities for each class. This provides the likelihood of an instance belonging to class 0 (benign) and class 1 (malignant).

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_scaled)
y_pred_proba = log_reg_model.predict_proba(X_test_scaled) # Get probabilities

print("\n--- Predictions Made ---")

8. Evaluating the Model

Model evaluation is crucial to understand how well our logistic regression model performs. We use several common metrics for classification tasks:

  • Accuracy: This is the proportion of correctly classified instances. It is calculated using accuracy_score.
  • Confusion Matrix: This table provides a detailed breakdown of correct and incorrect classifications for each class (True Positives, True Negatives, False Positives, False Negatives). It is generated using confusion_matrix.
  • Classification Report: This report, generated by classification_report, includes precision, recall, F1-score, and support for each class. These metrics provide a more nuanced view of performance, especially if the classes are imbalanced.
    • Precision measures the accuracy of positive predictions (TP / (TP + FP)).
    • Recall (or Sensitivity) measures the model’s ability to identify all actual positives (TP / (TP + FN)).
    • F1-score is the harmonic mean of precision and recall, providing a single score that balances both.
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")

conf_matrix = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_matrix}")

class_report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print(f"\nClassification Report:\n{class_report}")

The output of these evaluations will indicate the model’s effectiveness. For instance, a high accuracy and balanced precision/recall scores suggest good performance.

9. Interpreting Predicted Probabilities

To further understand the model’s output, we can look at the predicted probabilities for a few samples from the test set. This shows the model’s confidence in its predictions.

# Display some predicted probabilities for the first few test samples
print("\n--- Predicted Probabilities for first 5 test samples (Benign, Malignant) ---")
for i in range(5):
    print(f"Sample {i+1}: Actual={y_test.iloc[i]}, Predicted Proba={y_pred_proba[i]}, Predicted Class={y_pred[i]}")

Each row in y_pred_proba contains two probabilities: the first for class 0 (benign) and the second for class 1 (malignant). The predict() method typically assigns the class with the higher probability (usually based on a 0.5 threshold).

10. Interpreting Model Coefficients

Finally, we can examine the coefficients (weights) learned by the logistic regression model. These coefficients indicate the relationship between each feature and the log-odds of the outcome. A positive coefficient suggests that an increase in the feature’s value increases the log-odds of the outcome being class 1 (malignant), while a negative coefficient suggests the opposite. We can also exponentiate these coefficients to get odds ratios, which are often easier to interpret. An odds ratio greater than 1 means the odds of the outcome (malignant) increase with an increase in the feature, while an odds ratio less than 1 means the odds decrease.

# Interpreting Coefficients
coefficients = pd.DataFrame(log_reg_model.coef_[0], X.columns, columns=["Coefficient"])
print("\n--- Model Coefficients (Log-Odds) ---")
print(coefficients.sort_values(by="Coefficient", ascending=False))

odds_ratios = np.exp(log_reg_model.coef_[0])
odds_ratios_df = pd.DataFrame(odds_ratios, X.columns, columns=["Odds Ratio"])
print("\n--- Model Odds Ratios ---")
print(odds_ratios_df.sort_values(by="Odds Ratio", ascending=False))

This step provides insights into which features are most influential in the model’s predictions. It is important to remember that these interpretations are based on the scaled features if feature scaling was applied.

This worked example covers the end-to-end process of applying logistic regression, from data loading and preprocessing to model training, evaluation, and basic interpretation. The specific results (accuracy, coefficients, etc.) will depend on the dataset and the chosen parameters, but the methodology remains consistent.

# Python Worked Example for Logistic Regression

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_breast_cancer # Using a built-in dataset for simplicity

# Load the dataset
# The breast cancer dataset is a classic binary classification dataset.
# Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
# They describe characteristics of the cell nuclei present in the image.
# The target variable is whether the mass is malignant (1) or benign (0).
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target

print("--- Dataset Head ---")
print(df.head())
print("\n--- Dataset Info ---")
df.info()
print("\n--- Target Value Counts ---")
print(df["target"].value_counts())

# Define features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

# Split the data into training and testing sets
# We use 80% of the data for training and 20% for testing.
# random_state is set for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\n--- Shape of Training Data ---")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"--- Shape of Testing Data ---")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

# Feature Scaling
# Logistic regression can benefit from feature scaling, especially when using solvers that are sensitive to feature magnitudes.
# StandardScaler standardizes features by removing the mean and scaling to unit variance.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model
# We use a simple logistic regression model with default parameters for this example.
# max_iter is increased to ensure convergence for some solvers.
log_reg_model = LogisticRegression(solver="liblinear", random_state=42, max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)

print("\n--- Model Training Complete ---")

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_scaled)
y_pred_proba = log_reg_model.predict_proba(X_test_scaled) # Get probabilities

print("\n--- Predictions Made ---")

# Evaluate the model
# Accuracy: The proportion of correctly classified instances.
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")

# Confusion Matrix: A table showing the performance of a classification model.
# Rows represent the actual classes, and columns represent the predicted classes.
# TN | FP
# FN | TP
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_matrix}")

# Classification Report: Provides precision, recall, F1-score, and support for each class.
# Precision: TP / (TP + FP) - Ability of the classifier not to label as positive a sample that is negative.
# Recall (Sensitivity): TP / (TP + FN) - Ability of the classifier to find all the positive samples.
# F1-score: 2 * (Precision * Recall) / (Precision + Recall) - Weighted average of Precision and Recall.
# Support: The number of actual occurrences of the class in the specified dataset.
class_report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print(f"\nClassification Report:\n{class_report}")

# Display some predicted probabilities for the first few test samples
print("\n--- Predicted Probabilities for first 5 test samples (Benign, Malignant) ---")
for i in range(5):
    print(f"Sample {i+1}: Actual={y_test.iloc[i]}, Predicted Proba={y_pred_proba[i]}, Predicted Class={y_pred[i]}")

# Interpreting Coefficients (Optional, but good for understanding)
# The coefficients represent the change in the log-odds of the outcome for a one-unit increase in the predictor variable,
# holding other variables constant.
coefficients = pd.DataFrame(log_reg_model.coef_[0], X.columns, columns=["Coefficient"])
print("\n--- Model Coefficients (Log-Odds) ---")
print(coefficients.sort_values(by="Coefficient", ascending=False))

# To get odds ratios, we can exponentiate the coefficients
odds_ratios = np.exp(log_reg_model.coef_[0])
odds_ratios_df = pd.DataFrame(odds_ratios, X.columns, columns=["Odds Ratio"])
print("\n--- Model Odds Ratios ---")
print(odds_ratios_df.sort_values(by="Odds Ratio", ascending=False))

print("\n--- End of Worked Example ---")

References

  1. GeeksforGeeks. (2025, February 3). Logistic Regression in Machine Learning. GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/understanding-logistic-regression/
  2. Rai, K. (2020, June 14). The math behind Logistic Regression. Analytics Vidhya on Medium. Retrieved from https://medium.com/analytics-vidhya/the-math-behind-logistic-regression-c2f04ca27bca
  3. Wikipedia contributors. (2024, May 9). Logistic regression. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Logistic_regression
  4. Scikit-learn developers. (n.d.). sklearn.linear_model.LogisticRegression. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  5. Scikit-learn developers. (n.d.). sklearn.datasets.load_breast_cancer. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
  6. Scikit-learn developers. (n.d.). sklearn.model_selection.train_test_split. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
  7. Scikit-learn developers. (n.d.). sklearn.preprocessing.StandardScaler. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
  8. Scikit-learn developers. (n.d.). sklearn.metrics module. Scikit-learn. Retrieved from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
  9. Pandas development team. (n.d.). Pandas documentation. Pandas. Retrieved from https://pandas.pydata.org/pandas-docs/stable/
  10. NumPy developers. (n.d.). NumPy documentation. NumPy. Retrieved from https://numpy.org/doc/