Wednesday, June 4, 2025

Recommendations for rolling out generative AI to data science and technical coding teams

Summary - proceed with caution

This report gives guidance for rolling out code generation to data science teams. One size doesn't fit all, so you should use the post as a guide to shape your thinking, not as a recipe that can't be changed.

There are substantial productivity gains to be had from rolling out generative AI for code generation to data science teams, but there are major issues to be managed and overcome. Without effective leadership, including expectation setting, roll-outs will fail. 

Replacing notebooks with an agentic AI like Cursor will not succeed. The most successful strategy is likely the combined use of notebooks and an agentic AI IDE which will give data scientists an understanding of the benefits of the technology and its limitations. This is in preparation for the probable appearance of agentic notebook products in the near future.

For groups that use IDEs (like software developers), I recommend immediate use of Cursor or one of its competitors. I'm covering this in a separate report.

(Perplexity.AI)

Introduction

Why, who, and how

This is a guide for rolling out generative AI (meaning code generation) for data science teams. It covers the benefits you might expect to see, the issues you'll encounter, and some suggestions for coping with them. 

My comments and recommendations are based on my use of Cursor (an agentic IDE) along with Claude, Open AI and other code generation LLMs. I'm using them on multiple data science projects. 

As of June 2025, there are no data science agentic AI notebooks that have reached widespread adoption, however, in my opinion, that's likely to change later on in 2025. Data science teams that understand the use of agentic AI for code generation will have an advantage over teams that do not, so early adoption is important.

Although I'm focused on data science, all my comments apply to anyone doing technical coding, by which I mean code that's algorithmically complex or uses "advanced" statistics. This can include people with the job titles "Analyst" or "Software Engineer".

I'm aware that not everyone knows what Cursor and the other agentic AI-enabled IDEs are, so I'm writing a separate blog post about them.

(Gemini)

The situation for software engineers

For more traditional software engineering roles, agentic AI IDEs offer substantial advantages and don't suffer from the "not a notebook" problem. Despite some of the limitations and drawbacks of code generation, the gains are such that I recommend an immediate managed, and thoughtful roll-out. A managed and thoughtful roll-out means setting realistic goals, having proper training, and clear communications. 

  • Realistic goals covers productivity gains; promising productivity gains of 100% or more is unrealistic.
  • Proper training means educating the team on when to use code gen and when not to use it. 
  • Clear communications means the team must be able to share their experiences and learn from one another during the roll-out phase.

I have written a separate report for software engineering deployment.

Benefits for data science

Cursor can automate a lot of the "boring" stuff that consumes data scientist's time, but isn't core algorithm development (the main thing they're paid to do). Here's a list:

  • Commenting code. This includes function commenting using, for example, the Google function documentation format.
  • Documentation. This means documenting how code works and how it's structured, e.g. create a markdown file explaining how the code base works.
  • Boilerplate code. This includes code like reading in data from a data source.
  • Test harnesses, test code, and test data. Code generation is excellent at generating regression test frameworks, including test data.
  • PEP8 compliance. Cursor can restructure code to meet PEP8 requirements.

There are other key advantages too:

  • Code completion. Given a comment or a specifc prompts, Cursor can generate code blocks, including using the correct API parameters. This means less time looking up how to use APIs.
  • Code generation. Cursor can generate the outline of functions and much of the functionality, but this has to be well-managed.

Overall, if used correctly, Cursor can give a significant productivity boost for data science teams.

Problems for data science

It's not plain sailing, there are several issue to overcome to get productivity benefits. You should be aware of them and have a plan to address them.

It's not a notebook

(Gemini)

On the whole, data scientists don't use IDEs, they use notebooks. Cursor, and all the other agentic IDEs, are not notebooks. This is the most important issue to deal with and it's probably going to be the biggest cause of roll-out failure.

Notebooks have features that IDEs don't, specifically the ability to do "data interactive" development and debugging; which is the key reason why data scientists use them. Unfortunately, none of the agentic AI systems have anything that comes close to a notebook's power. Cursor's debugging is not AI enabled and does not easily allow notebook cell-like data investigations. 

Getting data scientists to abandon notebooks and move wholesale to an agentic IDE like Cursor is an uphill task and is unlikely to succeed. 

A realistic view of code generation for data science

Complex code is not a good match

Cursor and LLMs in general, are bad at generating technically complex code, e.g. code using "advanced statistical" methods. For example, asking for code to demonstrate random variable convolution can sometimes yield weird and wrong answers. The correctness of the solution depends precisely on the prompt. It also needs the data scientist to closely review the generated code. Given that you need to know the answer and you need to experiment to get the right prompt, the productivity gain of using code generation in these cases is very low or even negative.

It's also worth pointing out that for Python code generation, code gen works very poorly for Pandas dataframe manipulation beyond simple transformations.

Code completion

Code completion is slightly different from code generation and suffers from fewer problems, but it can sometimes yield crazily wrong code.

Data scientists are not software engineers and neither is Cursor

Data scientists focus on building algorithms, not on complete systems. In my experience, data scientists are bad at structuring code (e.g. functional decomposition), a situation made worse by notebooks. Neither Cursor, nor any of its competitors or LLMs, will make up for this shortcoming. 

Refactoring is risky

Sometimes, code needs to be refactored. This means changing variable names, removing unused code, structuring code better, etc. From what I've seen, asking Cursor to do this can introduce serious errors. Although refactoring can be done successfully, it needs careful and limited AI prompting.

"Accept all" will lead to failure

I'm aware of real-world cases where junior staff have blindly accepted all generated code and it hasn't ended well. Bear in mind, generated code can sometimes be very wrong. All generated code (and code completion code) must be reviewed. 

Code generation roll-out recommendations

Run a  pilot program first

A successful roll-out will require some experience, but where does this experience come from? There are two possibilities:

  • "Hidden" experience. It's likely that some staff have experimented with AI code gen, even if they're not data scientists. You can co-opt this experience.
  • Running a pilot program. Get a small number of staff to experiment intensively for a short period.
Where possible, I recommend a short pilot program prior to any widespread roll-out. The program should use a small number of staff and run for a month. Here are some guidelines for running a pilot program:

  • Goals:
    • To learn the strengths and weaknesses of agentic AI code generation for data science.
    • To learn enough to train others.
    • To produce a first-pass "rules of engagement".
  • Staff:
    • Use experienced/senior staff only. 
    • Use a small team, five people or less.
    • If you can, use people who have experimented with Cursor and/or code generation.
    • Don't use skeptics or people with a negative attitude.
  • Communication:
    • Frequent staff meetings to discuss learnings. Strong meeting leadership to ensure participation and sharing.
    • Slack (or the equivalent) channels.
  • Tasks:
    • Find a way of using agentic IDEs (e.g. Cursor) with notebooks. This is the most important task. The project will fail if you don't get a workable answer.
    • Work out "rules of engagement".
    • Work out how to train others.
  • Duration
    • Start to end, a month.

If you don't have any in-house experience, how do you "cold start" a pilot program? Here are my suggestions:

  • Go to local meetup.com events and see what others are doing.
  • Find people who have done this elsewhere (LinkedIn!) and pay them for advice.
  • Watch YouTube videos (but be aware, this is low-productivity exercise).

Don't try and roll-out AI code generation blind. 

Expectation setting

There are some wild claims about productivity benefits for code generation. In some cases they're true, you really can substantially reduce the time and cost of some projects. But for other projects (especially data science projects) the savings are less. Overstating the benefits has several consequences:

  • Loss of credibility with company leadership.
  • Loss of credibility with staff and harm to morale.

You need to have a realistic sense of the impact on your projects. You need to set realistic expectations right from the start.

How can you get that realistic sense? Through a pilot program.

Clear goals and measuring success

All projects need clear goals and some form of success metric. The overall goal here is to increase productivity using code generation while avoiding the implementation issues. Direct measures of success here are hard as few organizations have measures of code productivity and data science projects vary wildly in complexity. Some measures might be:

  • Fraction of code with all functions documented correctly.
  • Fraction of projects with regression tests.
  • High levels of staff usage of agentic AI IDEs.
The ultimate measure is of course that projects are developed faster.

At an individual level, metrics might include:

  • Contributions to "rules of engagement".
  • Contributions to Slack channel (or the equivalent).

Initial briefing and on-going communications 


(Canva)

Everyone in the process must have a realistic sense of the benefits of this technology and the problems, this includes the staff doing the work, their managers, and all executive and C-level staff.

Here are my suggestions:

  • Written briefing on benefits and problems.
  • Briefing meetings for all stakeholders.
  • Written "rules of engagement" stating how code is to be used and not used. These rules will be updated as the project proceeds.
  • Regular feedback sessions for hands-on participants. These sessions are where people share their experiences.
  • Regular reports to executives on project progress.
  • On-going communications forum. This could be something like a Slack channel.
  • Documentation hub. This is a single known place where users can go to get relevant materials, e.g.
    • Set-up instructions
    • Cursor rules (or the equivalent)
    • "Rules of engagement"

Clear lines of responsibility

Assuming there are multiple people involved in an evaluation or roll-out, we need to define who does what. For this project, this means:

  • One person to act as the (Cursor) rules controller. The quality of generated code depends on rules, if everyone uses wildly different rules the results will inconsistent. The rules controller will provide recommended rules that everyone should use. Participants can experiment with rules, but they must keep the controller informed.
  • One person to act as recommendations controller. As I've explained, there are "dos" and "don'ts" for working with code generation, these are the "rules of engagement". One person should be responsible for continually keeping this up to date. 

Limits on project scope

There are multiple IDEs on the market and their are multiple LLMs that will generate code. Evaluating all of them will take considerable time and be expensive. My recommendation is to choose one IDE (e.g. Cursor, Windsurf, Lovable or one of the others) and one agentic AI. It's OK to have some experimentation at the boundaries, e.g. experimenting with a different agentic AIs, but this needs to be managed - as always, project discipline is important.

Training

(Canva)

Just setting people up and telling them to get started won't work. Almost all data scientists won't be familiar with Cursor and the VS Code IDE it's based on. Cursor works differently from other IDEs, and there's little in the way of useful tutorials online. This begs the question, how do you get the expertise to train your team? 

The answer is a pilot program as I've explained. This should enable you to bootstrap your initial training needs using in-house experience.

You should record the training so everyone can access it later if they run into trouble. Training must include what not to do, including pointing out failure modes (e.g. blindly accepting generated code), this is the "rules of engagement".

It may also be worth re-training people partway through the project with the knowledge gained so far.

(Don't forget, data scientists mostly don't use IDEs, so part of your training must cover basic IDE usage.)

Notebook and Cursor working together

This is the core problem for data science. Figuring out a way of using an agentic IDE and a notebook together will be challenging. Here are my recommendations.

  1. Find a way of ensuring the agentic IDE and the notebook can use the same code file. Most notebooks can read in Python files and there are sometimes ways of preserving cell boundaries in Python (e.g. using the "# %%" format).
  2. Edit the same Python file in Cursor and in the notebook (this may mean refreshing the notebook so it picks up any changes, Cursor seems to pick up changes by itself).
  3. Use Cursor for comments, code completion etc. Use the notebook for live code development and debugging.

(Canva)

Precisely how to do this will depend on the exact choice of agentic IDE and notebook.

This process is awkward, but it's the best of the options right now.

(Cursor) Rules

Agentic IDEs rely on a set of rules that guide code generation. These are like settings but expressed in English prose. These rules will help govern the style of the generated code. What these rules are called will vary from IDE to IDE but in Cursor, they're called "Rules".

I suggest you start with a minimal set of Rules, perhaps 10 or so. Here are three to get you started:

"Act as an experienced data scientist creating robust, re-usable, and readable code.

Use the latest Python features, including the walrus operator. Use list comprehensions rather than loops where it makes sense.

Use meaningful variable names. Do not use df as the name of a dataframe variable."

There are several sites online that suggest Rules. Most suggest verbose and long Rules. My experience is that shorter and more concise works better.

Regression tests

As part of the development process, use Cursor to generate test cases for your code, which includes generating test data. This is one of Cursor's superpowers and one of the places where you can see big productivity improvements.

Cursor can occasionally introduce errors into existing code. Part of the "rules of engagement" must be running regression tests periodically or when the IDE has made substantial changes. In traditional development, this is expensive, but agentic IDEs substantially reduce the cost.

GitHub

Cursor integrates with GitHub and you can update Git repositories with a single prompt. However, it can occasionally mess things up. You should have a good set of tactics for GitHub integration, including having an in-house expert who can fix issues should they arise.

"Rules of engagement"

I've referred to this document a number of times. This is a written document that describes how to use code gen AI and how not to use it. Here are the kinds of things it should contain:

"Use code generation via the prompt to create function and code outlines, e.g. specifying that a file will contain 5 functions with a description of what the functions do. Most of the time, it's better to ask the agent to product code stubs. However, if a function is boilerplate, e.g. reading a CSV file into a dataframe, then you can prompt for full code generation for that function.
...
Do not use code generation or code completion for medium to complex dataframe manipulations. You can use it for simple dataframe manipulations. You can use code completion to get a hint, but you shouldn't trust it.
...
Use  the prompt to comment your code, but be clear in your prompt that you want comments only and no other changes.
... 

Before running regression tests, prompt the AI to comment your code. 

"

You should periodically update the rules of engagement and make sure users know the rules have changed. As I stated earlier, one person should be responsible for maintaining and updating the rules of engagement.

Conclusions

Successfully rolling out agentic AI code generation to data scientists is not a trivial tasks. It will require a combination of business and technical savvy. As ever, there are political waters to navigate, both up and down the organization.

There are some, key ideas I want to reiterate:
  • Agentic IDEs are not notebooks. You need to find a way of working that combines notebooks and IDEs. Success depends on this.
  • Pilot programs will let you bootstrap a roll-out, without them, you'll find roll-outs difficult to impossible.
  • Training, "rules of engagement", and communication are crucial.

Other resources

I'm in the process of developing a very detailed analysis of using Cursor for data science. This analysis would form the basis of the "rules of engagement". I'm also working on a document similar to this for more traditional software engineering. If you're interested in chatting, contact me on LinkedIn: https://www.linkedin.com/in/mikewoodward/.


Tuesday, May 27, 2025

What is Model Context Protocol?

Bottom line: MCP is an important technology, but as of May 2025, it's not ready for production deployment. It's immature, the documentation is poor, and it doesn't have the security features it needs. Unless your business has a compelling and immediate need for it, wait a while before starting experimentation.

I've been hearing a lot about MCP and how much of a game-changer it is, but there are three problems with most of the articles I've read:

  • They don't explain the what and the how very well.
  • They're either too technical or too high-level.
  • They smell too strongly of hype.

In this blog post, I'm going to dive into the why at a business level and do some of the how at a more technical level. This is going to be a hype free zone. 

(Chat GPT generated)

What problem are we trying to solve?

AI systems need to access data, but data is accessed in a huge number of ways, making it harder for an AI to connect and use data. MCP is a way of presenting the 'same' interface for all data types.

There are many different data sources, for example: JSON files, CSV files, XML files,  text files, different APIs, different database types, and so on. In any computer language, there are different ways of connecting to these data sources. Here are two Python code snippets that illustrate what I mean:

import requests

res = requests.get(
url="https://www.gutenberg.org/files/132/132-h/132-h.htm",
timeout=(10,5)

) 

and:

import lxml
...
# Open the XML file and parse it
tree = lxml.etree.parse(zip_file_names[0])
...
# Completely parse the first element
root = tree.getroot()
children = root.getchildren()[0].getchildren()

There are a couple of important points here:
  • You use different Python libraries to access different data sources.
  • The API is different.
  • In some cases, the way you use the API is different (e.g. some sources use paging, others don't).
In other words, it can be time consuming and tricky to read in data from different sources.

This is bad enough if you're a programmer writing code to combine data from different sources, but it's even worse if you're an AI. An AI has to figure out what libraries to use, what data's available, whether or not to use paging, etc. In other words, different data source interfaces make life hard for people and for AIs.

There's a related problem, often called the NxM problem. Let's imagine there are M data sources and N LLMs. Each LLM has to create an interface to each data source, so we get a situation that looks like this:

(Claude generated)

This is a huge amount of duplication (NxM). What's worse is if a data source changes its API (e.g, an AWS API update) we have to change N LLMs. If we could find someway of standardizing the interface to the data sources, we would have one set of code for each LLM (M) and one set of code for each data source (N), transforming this into an N+M problem. In this new world, if a data source API is updated, this just means updating one wrapper. Can we find some way of standardizing the interfaces?

(In the old days, this was a problem for hardware too. Desktop PCs would have a display port, an ethernet port, a printer port, and so on. These have pretty much all been replaced with USB-C ports. Can we do something similar in software?)

Some background

There has been a move to consolidate the interface to different sources, but it's been very limited. In the Python world, there's a database access library that lets you connect to most databases using the same interface, but that's about it. Until now, there just hasn't been a strong enough motivation for the community to work out how to provide consistent data access.

I want to go on two slight tangents to explain ideas that are important to MCP. Without these tangents, the choice of name is hard to understand, as are the core ideas.

At the end of the 1970's, Trygve Reenskaug was working at Xerox Parc on UI problems and came up with the Model-View-Controller abstraction. The idea is, a system can be divided into conceptual parts. The Model part represents the business data and the business logic. There's a code interface (API) to the Model that the View and Controller use to access data and get things done. 

The Model part of this abstraction corresponds to the data sources we've been talking about, but it generalizes them to include business logic (meaning, doing something like querying a database). This same abstraction is a feature of MCP too. Sadly, there's a naming conflict we have to discuss. Model means data in Model-View-Controller, but it's also part of the name "large language model" (LLM). In MCP, the M is Model, but it means LLM; the data and business logic is called Context. I'm going to use the word Context from now on to avoid confusion.

Let's introduce another key idea to understand MCP, that of the 'translation' or 'interface' layer. This is a well-known concept in software engineering and comes up a number of times. The best known example is the operating system (OS). An OS provides a standardized way of accessing the same functionality on different hardware. The diagram below shows a simple example. Different manufacturers make different disk drives, each with a slightly different way of controlling the drives. The operating system has a translation layer that offers the same set of commands to the user, regardless of who made the disk drive.

(Chat GPT generated)

Languages like Python rely on these translation layers to work on different hardware.

Let's summarize the key three ideas before we get to MCP:

  • There's been very little progress to standardize data access functionality.
  • The term Context refers to the underlying data and functionality related to that data.
  • Translation layer software allows the same operations to work on different machines.

What MCP is

MCP stands for Model Context Protocol. It's a translation layer on top of a data source that provides a consistent way of accessing different data sources and their associated tools. For example, you can access database data and text files data using the same interface.

  • The Model part of the acronym refers to the LLM. This could be Claude, Gemini, GPT, DeepSeek or one of the many other Large Language Models out there.
  • Context refers to the data and the tools to access it.
  • Protocol refers to the communication between the LLM and the data (Context).

Here's a diagram showing the idea.

What's interesting about this architecture is that the MCP translation layer is a server. More on this later.

In MCP terminology, users of the MCP are called Hosts (mostly LLMs and IDEs like Cursor or Windsurf, but it could be something else). Hosts have Clients that are connectors to Servers. A Host can have a number of Clients; it'll have one for each data source (Server) it connects to. A Server connects to a data source and uses the data source's API to collect data and perform tasks. A Server has functions the Client uses to identify the tasks the Server can perform. A Client communicates with the Server using a defined Protocol.

Here's an expended diagram providing a bit more detail.

I've talked about data sources like XML files etc., but it's important to point out that a data source could be Github, Slack, or Google Sheets or indeed any service. Each of these data sources has their own API and the MCP Server provides a standardizes way of using it. Note that the MCP Server could do some compute intensive tasks too, for example running a time-consuming SQL query on a database.

I'll give you an expanded example for how this all works,  let's say a user asks the  LLM (either standalone or in a tool like Cursor) to create a Github repo:

  • The Model, via its MCP Client, will ask the MCP Server for a list of capabilities for the Github service. 
  • The MCP Server knows what it can do, so it will return a list of available actions, including the ability to create a repo. 
  • The MCP Client will pass this data to the LLM. 
  • Now the Model knows what Github actions it can perform and it can check it can do what the user asked (create a repo). 
  • The LLM instructs its MCP Client to create the repo, which it in turn passes the request to the MCP Server, which in turn formats the request using the Github API. Github creates the repo and returns a status code to the MCP Sever, which in turn informs the Client which in turn informs the Host.

This is a lot of indirection, but it's needed for the whole stack to work.

This page: https://modelcontextprotocol.io/docs/concepts/architecture explains how the stack works in more detail.

How it works

How to set up the Host and Client

To understand the Host and Client set up, you need to understand that MCP is a communications standard (the Protocol part of the name). This means, we only have to tell the Client small amounts of information about the Server, most importantly, it's location. Once it knows where the Server is, it can talk to it.

In Cursor (a Host), there's an MCP setting where we can tell Cursor about the MCP Servers we want to connect to. Here's the JSON to connect to the Github MCP Server:

{
"mcpServers": {
"github": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"-e",
"GITHUB_PERSONAL_ACCESS_TOKEN",
"mcp/github"
],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "<YOUR_TOKEN>"
}
}
}
}

In this example, the line "mcp/github" is the location of the GitHub MCP server.

Setup is similar for LLMs, for example, the Claude desktop. 

I'm not going to explain the above code in detail (you should look here for details of how the Client works). You should note a couple of things:

  • It's very short.
  • It's terse.
  • It has some security (the Personal Access Token).

How to set up the MCP Server

MCP Servers have several core concepts:
  • Resources. They expose data to your Host (e.g. the LLM) and are intended for light-weight and quick queries that don't have side effects, e.g. a simple data retrieval.
  • Tools. They let the Host tell the Server to take an action. They can be computationally expensive and can have side effects.
  • Prompts. These are templates that standardize common interactions.
  • Roots and Sampling. These are more advanced and I'm not going to discuss them here.

These are implemented in code using Python function decorators, a relatively new features of Python.

Regardless of whether it's Prompts, Tools, or Resources, the Client has to discover them, meaning, it has to know what functionality is available. This is done using discovery functions called list_resources, list_prompts, and of course list_tools. So the Client calls the discovery functions to find out what's available and then calls the appropriate functions when it needs to do something. 

Resources

Here's are two examples of resource function. The first function lets the Client find out what resources are available, which in this case is a single resource, the application log. The second function is how the Client can access the application log contents.

@app.list_resources()
async def list_resources() -> list[types.Resource]:
return [
types.Resource(
uri="file:///logs/app.log",
name="Application Logs",
mimeType="text/plain"
)
]

@app.read_resource()
async def read_resource(uri: AnyUrl) -> str:
if str(uri) == "file:///logs/app.log":
log_contents = await read_log_file()
return log_contents

raise ValueError("Resource not found")

Note the use of async and the decorator.  The async allows us to write efficient code for tasks that may take some time to complete.

Tools

Here's an example of two tool functions. As you might expect by now, the first function lets the Client discover which tools it can call.

@app.list_tools()
async def list_tools() -> list[types.Tool]:
return [
types.Tool(
name="calculate_sum",
description="Add two numbers together",
inputSchema={
"type": "object",
"properties": {
"a": {"type": "number"},
"b": {"type": "number"}
},
"required": ["a", "b"]
}
)
]

The second function is a function the Client can call once the Client has discovered it. 

@mcp.tool()
async def fetch_weather(city: str) -> str:
"""Fetch current weather for a city"""
async with httpx.AsyncClient() as client:
response = await client.get(f"https://api.weather.com/{city}")
return response.text

Here, the code is calling out to an external API to retrieve the weather for a city. Because the external API might take some time, the code uses await and async. This is a tool rather than a resource because it may take some time to complete.

Prompts

This is a longer code snippet to give you the idea. The list_prompts function is key: this is how the Client finds out the available prompts.

PROMPTS = {
"git-commit": types.Prompt(
name="git-commit",
description="Generate a Git commit message",
arguments=[
types.PromptArgument(
name="changes",
description="Git diff or description of changes",
required=True
)
],
),
"explain-code": types.Prompt(
name="explain-code",
description="Explain how code works",
arguments=[
types.PromptArgument(
name="code",
description="Code to explain",
required=True
),
types.PromptArgument(
name="language",
description="Programming language",
required=False
)
],
)
}
...
@app.list_prompts()
async def list_prompts() -> list[types.Prompt]:
return list(PROMPTS.values())
...

@app.get_prompt()
async def get_prompt(
name: str, arguments: dict[str, str] | None = None
) -> types.GetPromptResult:
if name not in PROMPTS:
raise ValueError(f"Prompt not found: {name}")

if name == "git-commit":
changes = arguments.get("changes") if arguments else ""
return types.GetPromptResult(
messages=[
types.PromptMessage(
role="user",
content=types.TextContent(
type="text",
text=f"Generate a concise but descriptive commit message "
f"for these changes:\n\n{changes}"
)
)
]
)

You can read more about how prompts work in the documentation: https://modelcontextprotocol.io/docs/concepts/prompts#python

Messages everywhere

The whole chain of indirection relies on JSON message passing between code running in different processes. This can be difficult to debug. You can read more about MCP's message passing here: https://modelcontextprotocol.io/docs/concepts/transports

Documents, tutorials, and YouTube

At the time of writing (May 2025), the documentation for MCP is very sparse and lacks a lot of detail. There are a few tutorials people have written, but they're quite basic and again lack detail. What this means is, you're likely to run into issues that may take time to resolve.

There are videos on YouTube, but most of them have little technical content and seem to be hyping the technology rather than offering a thoughtful critique or a guide to implementation. Frankly, don't bother with them.

Skills needed

This is something I've hinted at in this blog post, but I'm going to say it explicitly. The skill level needed to implement a non-trivial MCP is high. Here's why:

  • The default setup process involves using uv rather than the usual pip.
  • The MCP API makes extensive use of function decorators, an advanced Python feature.
  • The Tools API uses async and await, again more advanced features.
  • Debugging can be hard because MCP relies on message passing.

The engineer needs to know about function decorators, asynchronous Python, and message passing between processes.

Where did MCP come from?

MCP was released by Anthropic in November 2024. After a "slowish" start, it's been widely adopted and has now become the dominant standard. Anthropic have open-sourced the entire protocol and placed it on GitHub. Frankly, I don't see anything usurping it in the short term.

Security and cost

This is a major concern. Let's go back to this diagram:

There could be three separate companies involved in this process:

  • The company that wants to use the LLM and MCP, we'll call this the User company.
  • The company that hosts the LLM, we'll call this the LLM company.
  • The company that hosts the data source, we'll call this the Data company.

The User company starts a job that uses an LLM in the LLM company. The job uses computationally (and $ costly) resources located at the Data company. Let's say something goes wrong, or the LLM misunderstands something. The LLM could make multiple expensive calls to the data source through the MCP Server, racking up large bills. Are there ways to stop this? Yes, but it takes some effort. 

The other concern is a hacked remote LLM, Remember, the LLM has the keys to the kingdom for your system, so hackers really could go to town, perhaps making rogue calls to burn up expensive computing resources or even writing malicious data.

There are a number of other concerns that you can read more about here: https://www.pillar.security/blog/the-security-risks-of-model-context-protocol-mcp and here: https://community.cisco.com/t5/security-blogs/ai-model-context-protocol-mcp-and-security/ba-p/5274394

The bottom line is, if you're running something unattended, you need to put guard rails around it

Complexity - everything is a server?

As I've stated, this is a very complex beast under the hood. The LLM will run in its own process, the MCP Server will run in its own process, and maybe the underlying data sources will too (e.g. a web-based resource or a database). If any of these processes fail, then the whole system fails. If a system fails, the developers have to debug which of all these servers failed first. Inter-process communication is harder than simple procedure calls which means debugging is too. 

All of the examples I've seen on the web have been relatively simple. I'm left wondering how complex it would be to develop a robust system with full debugging for something like a large-scale database. I'm not sure I want to be first to find out.

How can I get started?

I couldn't find tutorials or articles that are good enough for me to recommend. That of itself is telling.

Where we stand today

MCP was released in November 2024 and it's still an immature standard. 

  • Security in particular is not where it needs to be; you need to put guard rails up. 
  • Documentation is also sorely lacking and there are very few good tutorials out there. 
  • Debugging can be very hard, the message passing infrastructure is more difficult to work with than a simple call stack.

Sadly, the hype machine has really got going and you would think that MCP is ready for prime time and immediate deployment - it's not. This is definitely an over-hyped technology for where we are now

Should you experiment with MCP? Only if you have a specific reason to, and then only with supervision and risk management. If you have the right use case, this is a very compelling technology with a lot of promise for the future.

Monday, May 19, 2025

What is a random variable?

Just because we can't predict something exactly doesn't mean we can't say anything about it at all

There are all kinds of problems where we can't say exactly what the value of something is, but we can still say useful things about it. Here are some examples.

  • The number of goals scored in a football or hockey match.  We might not be able to predict the number of goals scored in a particular match, but we can say something:
    • We know that the number of goals must be an integer greater than or equal to 0.
    • We know that the number of goals is likely to be low and that high scores are unlikely; seeing two goals is far more likely than seeing 100 goals.
  • The number of people buying tickets at a movie theater. We know this will depend on the time of year, the day of the week, the weather, and the movies playing, etc. but even allowing for these factors, there's randomness. People might go on dates (or cancel them) or decide on a whim to see a movie. In this case, we know the minimum tickets is zero, the maximum is the number of seats, and that only an integer number of tickets can be sold. 
  • The speed of a car on the freeway. Plainly, this is affected by a number of factors, but there's also randomness at play. We know the speed will be a real number greater than zero. We know that in the absence of traffic, it's more likely the car will be traveling at the speed limit than say 20mph.
  • The score you get by rolling a dice.
(Dietmar Rabich / Wikimedia Commons / “Würfel, gemischt -- 2021 -- 5577” / CC BY-SA 4.0

For print products: Dietmar Rabich / https://commons.wikimedia.org/wiki/File:W%C3%BCrfel,_gemischt_--_2021_--_5577.jpg / https://creativecommons.org/licenses/by-sa/4.0/
Alternatively: Dietmar Rabich / https://w.wiki/9A49 / https://creativecommons.org/licenses/by-sa/4.0/)

In all these cases, we're trying to measure something, but randomness is at play, which means we can't predict an exact result, but we can still make probabilistic predictions. We can also do math with these predictions, which means we can use them to build computer models and make predictions about how a system might behave.

The variables we're trying to measure are called random variables and I'm going to describe what they are in this blog post. I'm going to start by providing some background ideas we'll need to understand, then I'm going to show you why random variables are useful.

What is a mathematical function?

Functions are going to be important to this story, so bear with me.

In math, a function is some operation where you give it some input and it produces some output. The classic examples you may remember are the trigonometric functions like \(sin(x)\), \(cos(x)\), and \(tan(x)\). A function could have several inputs, for example, this is a function: \(z = a_0 + a_1x^1 + a_2 y^3\).

Functions are very common in math, so much so that it can be a little hard to spot them, as we'll see.

Describing randomness - distributions

A probability distribution is a math function that tells you how likely the outcome of a process is. For example, a traffic light can be red, yellow, or green. How likely is it that the next traffic light I come to will be red, yellow, or green? It must be one of them, so the probabilities must sum to one, but we know that yellow is shorter than red or green, so yellow is less likely. Obviously, we can discuss the relative likelihood of red or green.

Probability distributions can get very complicated, but many of them follow well-known patterns. For example, when rolling an unbiased dice, the probability distribution is a discrete uniform distribution that looks like this:

the number of goals scored in a hockey or football match is known to be well-modeled by a (discrete) Poisson distribution that looks like this:

male (or female) heights are well-modeled by a (continuous) normal distribution that looks like this:

There are hundreds of known distributions, but in practice, only a few are "popular".

Discrete or continuous

There are two type of measurements we typically take: continuous and discrete.

Discrete measurements are things that come in discrete chunks, for example, the number of sheep in a flock, the number of goals in a match, the number of people in a movie theater, and so on. Categorical variables are "sort of" discrete, for example the colors of a traffic light, though they are a special case.

Continuous measurements are things that can take any value (including any number of digits after the decimal point). For example, the speed of a car on the freeway could be 72.15609... mph, someone's height might be 183.876... cm and so on. 

This seems clear, but sometimes we muddy the waters a bit. Let's say we're measuring height and we measure in whole cm. This transforms the measurement from a continuous one to a discrete one.

There are two types of probability distribution: continuous and discrete. We use continuous distributions for continuous quantities and discrete for discrete quantities. You should note that in the real world, it's often not this simple.

Random variables

A random variable is a math function the output of which depends on some random process. The values of the random variable follow a probability distribution. Here are some examples of observations that we can describe using random variables:

  • the lifetime of a lightbulb
  • goals scored
  • the result of rolling a dice
  • the speed of cars of a freeway
  • the height of a person
  • sales revenue

Dice are easy to understand, so I'll use it as an example. We don't know what the result of throwing the dice will be, but we know the probability distribution is uniform discrete, so the probability of throwing a 1 is \(\dfrac{1}{6}\), the probability of throwing a 2 is \(\dfrac{1}{6}\), and so on. Let's say we're gambling on dice, betting $1 and winning $6 if our number comes up. Using random variable math, we can work out what our gain or loss might be. In the dice example, it's trivial, but in other cases, it gets harder and we need some more advanced math. 

Random variables have a set of all possible results, which can be finite or infinite, that's called the sample space. The sample space is a set denoted by \(\Omega\). For the dice example, the sample space is simply:

\[\Omega = \{1,2,3,4,5,6\}\]

For a continuous quantity, like the lifetime of a bulb:

\[\Omega =  \{x | x ∈ \mathbb{R} \} \]

which means an infinite sample space. 

Infinite sample spaces, or large discrete sample spaces means we can't work things out by hand, we need more powerful math to do anything useful, and that's where things get hard.

A measurement (or observation) is the process of selecting a value from the sample space. Remember, the random variable has a probability distribution that tells you how likely different values are to be selected. 

Arithmetic with random variables - doing something useful

In this section and the next, I'll start to show you some interesting things you can do with random variables. To illustrate a key idea, we'll use a simple example. We'll work out the probability distribution for the combined scores we get by throwing two unbiased dice. 

We know the distribution is uniform for both dice, so we could work it out by hand like this:

Table 1: combining the scores of two dice
Dice 1 Dice 2 Combined score Probability
1 1 2 \(\dfrac{1}{36}\)
1 2 3 \(\dfrac{1}{36}\)
1 3 4 \(\dfrac{1}{36}\)
...
2 1 3 \(\dfrac{1}{36}\)
2 2 4 \(\dfrac{1}{36}\)
2 3 5 \(\dfrac{1}{36}\)
...

the next step is adding up the probabilities of the combined scores:

  • there's only one way of getting 2, so it's probability is \(\dfrac{1}{36}\)
  • there are two ways of getting 3, so it's probability is \(\dfrac{1}{36} + \dfrac{1}{36}\)
  • ...

this is really tedious, and obviously would be hugely expensive for a large sample space. There's a much faster way I'm going to show you.

To add two random variables, we use a process called convolution. This is a fancy way of saying we multiply the elements of one random variables by all the elements of the other random variable and add the probabilities. Mathematically, it looks like this for a discrete random variables, where \(f\) is the distribution for the first dice and \(g\) the distribution for the second dice:

\[f * g[n] = \sum_{m=-M}^{M}{f[n-m]g[n]}\]

In Python, we need to do it in two stages: work out the sample space and work out the probabilities. Here's some code to do it for two dice.  

import numpy as np

score1, score2 = np.arange(1, 7), np.arange(1, 7)
prob1, prob2 = np.ones(6) / 6, np.ones(6) / 6

combo_score = list(range(score1[0] + score2[0], score1[-1] + score2[-1] + 1))
combo_prob = np.convolve(prob1, prob2)

print(combo_score)
print(combo_prob)

This is easy to do by hand for two dice, but not when the the data sets get a lot bigger, that's when we need computers.

The discrete case is easy enough, but the continuous case is harder and the math is more advanced. Let's take an example to make things more concrete. Let's imagine a company with two sales areas. An analyst is modeling them as continuous random variables. How do we work out the total sales? The answer is continuous convolution of the two sales areas and here's the answer:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau) \,d\tau\]

This is obviously a lot more complicated. It's so complicated, I'm going to spend a little time explaining how to do it.

Broadly speaking, there are three approaches to continuous convolution: special cases, symbolic calculation, and discrete approximations.

In a handful of cases, convolving two continuous random variables has known answers. For example, convolving normal distributions gives a normal distribution and convolving uniform distributions gives an Irwin-Hall distribution.

In almost all cases, it's possible to do a symbolic calculation using integration. You might think that something like SymPy could do it, but in practice, you need to do it by hand. Obviously, you need to be good at calculus. There are several textbooks that have some examples of the process and there are a number of discussions on StackOverflow. From what I've seen, college courses in advanced probability theory seem to have course questions on convolving random variables with different distributions and students have asked for help with them online. This should give you an inkling of the level of difficulty.

The final approach is to use discrete approximations to continuous functions and use discrete convolution. This tends to be the default in most cases.

Worked example with random variables: predicting revenue and net income

Let's say we want to model the total sales revenue (\(t\)) from several regions (\(s_0, s_1, ...s_n\)) that are independent. We also have a model of expenses for the company as a whole (\(e\)). How can we model total revenue and net income?

Let's assume the sales revenue in each region is modeled by random variables, each having a normal distribution. We have mean values \(\mu_0, \mu_1, ..\mu_n\) and standard deviations \(\alpha_0, \alpha_1, ...\alpha_n\). To get total sales, we have to do convolution:

\[t = s_0 * s_1 * ... * s_n\]

This sounds complicated, but for the normal distribution, there's a short-cut. Convolving normal with normal gives normal, all we have to do is add the means and the variances. So the total sales number is a normal distribution with mean and variance:

\[\mu = \sum_{i=0}^{n}\mu_i\]

\[\alpha^2 = \sum_{i=0}^{n}\alpha_{i}^{2}\]

Getting net income is tiny bit harder. If you remember your accountancy text books, net income \(ni\) is:

\[ni = t - e\]

If expenses are modeled by the normal distribution, the answer here is just a variation of the process I used for combining sales. But what if expenses are modeled by some other distribution? That's where things get tough. 

Combining random variables with different probability distributions is hard. There's no good inventory I could find on the web of known solutions. You can do the symbolic calculation by hand, but that requires a good grasp of calculus. You might think that something like SymPy would work, but at the time of writing, SymPy doesn't have a good way of doing it. The final way of doing it is using a discrete approximation, but that's time consuming to do.  Bottom line: there's no easy solution if the distributions aren't all normal or aren't all uniform.

Division and multiplication with random variables

Most problems using random variables seem to boil down to adding them. If you need to multiply or divide random variables, there are ways to do it. The book "The Probability Lifesaver" by Stephen J. Miller  explains how.

Minimum, maximum, and expected values

I said that convolving random variables can be very hard, but getting some values is pretty straightforward.

The maximum of two random variables \(f\) and \(g\) is simply \(max(f) + max(g)\)

The minimum of two random variables \(f\) and \(g\) is simply \(min(f) + min(g)\)

What about the mean? It turns out, getting the mean is easy too. The mean value of a random variable is often called the expectation value and is the result of a function called \(E\), so the mean of a random value \(X\) is \(E(X)\).  The formula for the mean of two random variables is:

\[E(X + Y) = E(X) + E(Y)\]

In simple words, we add the means. 

Note I didn't say what the underlying distributions were. That's because it doesn't matter.

What if we apply some function to a random variable? It turns out, you can calculate the mean of a function of a random variable fairly easily and the arithmetic for combining multiple means is well known. There are pages on Wikipedia that will show you how to do it (in general, search for "linear combinations of expectation values" to get started).

Bringing it all together

There are a host of business and technical problems where we can't give a precise answer, but we can model the distribution of answers using random variables. There's a ton of theory surrounding the properties and uses of random variable, but it does get hard. By combining random variables, we can build models of more complicated systems, for example, we could forecast the range of net incomes for a company for a year. In some cases (e.g. normal distributions), combining random variables is easy, in other cases, it takes us in the world of calculus or discrete approximations. 

Yes, random variables are hard, but they're very powerful.

Wednesday, May 14, 2025

You need to use Manus

What is Manus - agentic AI

Manus is an AI agent capable of performing a number of high-level tasks that previously could only be done by humans. For example, it can research an area (e.g. a machine learning method) and produce an intelligible report, it can even turn a report into an interactive website. You can get started on it for free.

It created a huge fuss on its release, and rightly so. The capabilities it offers are ground-breaking. We're now a few months later and it's got even better.

In this blog post, I'm going to provide you with some definitions, show you what Manus can do, give you some warnings, and provide you with some next steps.

If you want to get an invitation to Manus, contact me.

How it works 

We need some definitions here. 

An LLM (Large Language Model) is a huge computer model that's been trained on large bodies of text. That could be human language (e.g. English, Chinese) or it could be computer code (e.g. Python, JavaScript). An LLM can do things like:

  • extract meaning from text e.g. given a news article on a football match, it can tell you the score, who won, who lost, and other details from the text
  • predict the next word in a sentence or the next sentence in a paragraph
  • produce entire "works", for example, you can ask an LLM to write a play on a given theme.

A agent is an LLM that controls other LLMs without human intervention. For example, you might set it the task of building a user interface using react.js. The agent will interpret your task and break it down to several sub tasks. It will then ask LLMs to build code for each sub task and stitch the code together.  More importantly for this blog post, you can use an agent to build a report for you on a topic. The agent will break down your request into chunks, assign those chunks to LLMs, and build an answer for you. An example topic might be "build me a report on what to do during a 10 day vacation in Brazil".

Manus is an agentic AI. It will split your request into chunks, assign those chunks to LLMs (it could be the same LLM or it could be different ones depending on the task), and combine the results into a report.

An example

I gave the following instructions to Manus:

You are an experienced technical professional. You will write a report explaining how logistic regression works for your colleagues. Your report will be a Word document. Your report will include the following sections:

* Why logistic regression is important.

* The theory and math behind it.

* A worked example. This will include code in Python using the appropriate libraries.

You will include the various math formula using the correct notation. You will provide references where appropriate.

Here's how it got started:


After it started, I realized I needed to modify my instructions, here's the dialog:

It incorporated my request and did add more sections.

Here's an example of how it kept me updated:

After 20 minutes, it produced a report in Word format. After reading the report, I realized I wanted to turn it into a blog post, so I asked Manus to give me the report as a HTML document, which it did. 

I've posted the report as a blog post and you can read it here: https://blog.engora.com/2025/05/the-importance-of-logistic-regression.html

A critique of the Manus report

I'm familiar with logistic regression so I can critique what Manus returned. I'd give it a B+. This may sound a bit harsh, but that's a very credible result for 20 minutes of effort. It's enough to get going with but it's not enough of itself. Here's my assessment.

  • Writing style and use of English. Great. Better than most native English speakers.
  • Report organization. Great. Very clear and concise. Nicely formatted.
  • Technically correctness. I couldn't spot anything wrong with what it produced. It did miss important stuff out though and did have some oddities:
    • Logistic regression with more than two target variables, no mention of it.
    • Odds ratio can vary from from 0 to +\(\infty\) but it didn't mention it. This is curious as it pointed out that linear regression can vary from -\(\infty\) to +\(\infty\) in the prior paragraphs.
    • Too terse description of the sigmoid function. It should have included a chart and it should have had a deeper discussion of some of the relevant properties of the function.
    • No meaningful discussion of decision boundaries (one mention in not enough detail).
  • Formula. A curious mixed bag. In some cases, it gave very good formula using the standard symbols and in other cases it gave code-like formula. This might be because I told it I wanted a Word report. By default, it uses markdown and it may be better to keep the report in markdown. It might be worth experimenting telling it use Latex for formula.
  • Code. Great.
  • References. Not great. No links back to the several online books that talk about logistic regression in some detail. No links to academic papers. The references it did provide were kind of OK, but really not enough and overall, not high quality enough.

To fix some of these issues, I could have tweaked my prompt, for example, telling it to use academic references, or giving it instructions to expand certain areas etc. This would cost more tokens. I could have told it to use high-effort reasoning which would also have cost me more tokens. 

Tokens in AI

Computation isn't free and that's especially true of AI. Manus, in common with many other AI services, uses a "token" model. This report cost me 511 tokens. Manus gives you a certain number of tokens for free, which is enough for experimentation but not enough for commercial use.

What's been written about it

Other people have written about Manus too. Here are some reviews:

Who owns Manus

Manus is owned by a Chinese company called Monica (also known as Butterfly Effect AI) based in Wuhan.

Some cautions

As with any LLM or agentic AI, I suggest that you do not share company confidential information or PII. This includes data, but also includes text. Some LLMs/agents will use any data (including text) you supply to help train their models. This might be OK, but it also might not be OK - proceed with caution.

Before you use any agentic AI or an LLM for "production" use, I suggest a legal and risk review.

  • What does their system do with the data you send it? Does it retain the data, does it train the model? Is it resold?
  • What does their system do with the output (e.g. final report, generated code)? 
  • Can you ask for your data to be removed from their model or system?

What this means - next steps

These types of agentic AI are game-changers. They will get you information you need far faster and far cheaper than a human could do it. The information isn't perfect and perhaps you wouldn't give it an A, but it's more than good enough to get started and frankly, most humans don't produce A work.

If you're involved in any kind of knowledge work, you should be experimenting with Manus and its competitors. This technology has obvious implications for employment and if you think you might be affected, it behoves you to understand what's going on.

If you want to get started, reach out to me to get an invitation to Manus and get extra free tokens.