Showing posts with label AI. Show all posts

Wednesday, June 25, 2025

AI networking in the Boston area

A lot's happening in Boston - where should I find out more?

There's a lot of AI work going on in the Boston area covering the whole spectrum, from foundational model development, to new AI applications, to corporates developing new AI-powered apps, to entrepreneurs creating new businesses, to students building prototypes in 12 hours. Pretty much every night of the week you can go to a group where you can find out more; there are a ton of different groups out there. But not all of them are created equal. I've been to a lot of groups and here are my recommendations for the best ones that meet on a regular basis. The list is alphabetical.

(Google Gemini)

AI Tinkerers

What it is

Monthly meeting where participants show the AI projects they've been working on. Mostly, but not exclusively, presentations from the Sundai Club (Harvard and MIT weekly hackathons). Attendance is over 150.

Commentary

This is where I go to when I want to see what's possible and find out more about the cutting edge. It's where I found out what tools like Cursor could really do. There are a number of VCs in attendance watching for anything interesting.

How often it meets

Once a month at Microsoft NERD.

Positives

You get to see what the cutting edge is really like.

Negatives

I found networking at this event less useful than some of the other events.

How to join

https://boston.aitinkerers.org/

AI Woodstock

What it is

A networking event for people interested in AI. It attracts practitioners, some VCs, recruiters, academics, and entrepreneurs. Attendee numbers vary, but typically over 100.

Commentary

This is networking only, there are no presentations or speakers of any kind. You turn up to the venue and introduce yourself to other people, and get talking. I've met people who are starting companies, people who are working on side gigs, and people who are working in AI for large companies.

The quality is high; I've learned a lot about what's going on and what companies in the Boston area are doing.

The venue is both good and bad. It's held in a corner of the Time Out Market near Fenway Park. This is a large space with lots of food and drink vendors, it attracts the bright young things of the Boston area who go there to eat and drink after work. AI Woodstock doesn't take over the whole space or rope off a portion of it and AI Woodstock attendees are only identified by name badges. This means you're chatting away to someone about their AI enabled app while someone is walking by with their drink and app to meet their friends. The background noise level can be really high at times.

How often it meets

Once a month at the Time Out Market near Fenway Park.

Positives

Networking. This is one of the best places to meet people who are active in AI in Boston.
Venue. It's nice to meet somewhere that's not Cambridge and the food and drink offerings are great.

Negatives

Venue. The noise level can get high and it can get quite crowded. The mix of bright young things out to have a good time and AI people is a bit odd.

How to join

https://www.meetup.com/ai-woodstock/ - choose Boston

Boston Generative AI Meetup

What it is

This is a combination of networking and panel session. During the networking, I've met VCs, solo entrepreneurs, AI staff at large companies, academics, and more. Attendance varies, but typically over 200.

Commentary

This is held in Microsoft NERD in Cambridge and it's the only event in the space. This means it starts a bit later and has to finish on time.

Quality is very high and I've met a lot of interesting people. I met someone who showed me an app they'd developed and told me how they'd done it, which was impressive and informative.

The panel sessions have been a mixed bag; it's interesting to see people speak, and I found out a lot of useful information, but the panel topics were just so-so for me. Frankly, what the panelists said was useful but the overall topic was not.

How often it meets

About once a month.

Positives

Networking.
Venue.
Information. The panels have mentioned things I found really useful.

Negatives

Panel session topics were a bit blah.

How to join

https://www.meetup.com/boston-generative-ai-meetup/

PyData Boston

What it is

Presentations plus networking. This is almost all machine learning/data science/AI practitioners in the Boston area (no VCs, no business people, instead there are academics and engineers). The presentations are mostly on technical topics, e.g. JAX. Attendance varies, but usually 50-100.

Commentary

I've learned more technical content from this group than any other. The presentations are in-depth and not for people who don't have a goodish background in Python or data science.

How often it meets

Once a month, usually at the Moderna building in Cambridge.

Positives

Best technical event. In-depth presentations have helped educate me and point out areas where I need to learn more. Conversations have been (technically) informative.
Probably the friendliest group of all of them.

Negatives

No entrepreneurs, no VCs, no executive management.

How to join

https://www.meetup.com/pydata-boston-cambridge/

Common problems

There's a refrain I've heard from almost all event organizers and that's the problem of no-shows. The no-show rate is typically 40% or so, which is hugely frustrating as there's often a a waiting list of attendees. Some of these events have instituted a sign-in policy, if you don't turn up and sign in, you can't attend future events, and I can see more events doing it in future. If you sign up, go.

One-off events

As well as these monthly events, there are also one-off events that happen sporadically. Obviously, I can't review them here, but I will say this, the quality is mostly very high but it is variable.

What's missing

I'm surprised by what I'm not hearing at these events. I'm not hearing implementation stories from existing ("mature") companies. Through private channels, I'm hearing that the failure rate for AI projects can be quite high, but by contrast I've been told that insurance companies are embracing AI for customer facing work and getting great results. I've met developers working on AI enabled apps for insurance companies and they tell me their projects have management buy-in and are being rolled out.

I'd love to hear someone from one of these large companies get up and speak about what they did to encourage success and the roadblocks on the way. In other words, I'd like to see something like "Strategies and tactics for successful AI projects" run by people who've done it.

Your thoughts

I've surely missed off groups from this list. If you know of a good group, please let me know either through LinkedIn or commenting on this post.

Logistic regression - a simple briefing

A briefing on logistic regression

I've been looking again at logistic regression and going over some of the theory behind it. In a previous blog post, I talked about how I used Manus to get a report on logistic regression and I showed what Manus gave me. I thought it was good, B+, but not great, and I had some criticisms of what Manus produced. The obvious challenge is, could I do better? This blog post is my attempt to explain logistic regression better than Manus.

What problems are we trying to solve?

There are a huge class of problems where we’re trying to predict a binary result, here are some examples:

The results of a referendum, e.g., whether or not to remain in or leave the EU.
Whether to give drug A or drug B to a patient with a condition.
Which team will win the World Cup or Super Bowl or World Series.
Is this transaction fraudulent?

Typically, we’ll have a bunch of different data we can use to base our prediction model on. For example, for a drug choice, we may have age, gender, weight, smoker or not, and so on. These are called features. Corresponding to this feature set, we’ll have a set of outcomes (also called labels), for example, for the drug case, it might be something like percentage survival (a% survived given drug A compared to b% for drug B). This makes logistic regression a supervised machine learning method.

In this blog post, I’ll show you how you can turn feature data into binary classification predictions using logistic regression. I’ll also show you how you can extend logistic regression beyond binary classification problems.

Before we dive into logistic regression, I need to define some concepts.

What are the odds?

Logistic regression relies on the odds or the odds ratio, so I’m going to define what it is using an example.

For two different drug treatments, we have different rates of survival. Here’s a table adapted from [1] that shows the probability of survival for fictitious study.

	Standard treatment	New treatment	Totals
Died	152 (38%)	17 (17%)	169
Survived	248 (62%)	103 (83%)	351
Totals	400 (100%)	120 (100%)	520

Plainly, the new treatment is much better. But how much better?

In statistics, we define the odds as being the ratio of the probability of something happening to it not happening:

\[odds = \dfrac{p}{1 - p}\]

So, if there’s a 70% chance of something happening, the odds of it happening are 2.333. Probabilities can range from 0 to 1 (or 0% to 100%), whereas odds can range from 0 to infinity. Here’s the table above recast in terms of odds.

	Standard treatment	New treatment
Died	0.613	0.165
Survived	1.632	6.059

The odds ratio tells us how much more likely an outcome is. A couple of examples should make this clearer.

The odds ratio for death with the standard treatment compared to the new is:

\[odds \: ratio = \dfrac{0.613}{0.165} = 3.71...\]

This means a patient is 3.71 times more likely to die if they’re given the standard treatment compared to the new.

More hopefully, the odds ratio for survival with the new treatment compared to the old is:

\[odds \: ratio = \dfrac{6.059}{1.632} = 3.71...\]

Unfortunately, most of the websites out there are a bit sloppy with their definitions. Many of them conflate “odds” and “odds ratio”. You should be aware that they’re two different things:

The odds is the probability of something happening divided by the probability of it not happening.
The odds ratio compares the odds of an event in one group to the odds of the same event in another group.

The odds are going to be important for logistic regression.

The sigmoid function

Our goal is to model probability (e.g. the probability that the best treatment is drug A), so mathematically, we want a modeling function that has a y-value that varies between 0 and 1. Because we’re going to use gradient methods to fit values, we need the derivative of the function, so our modeling function must be differentiable. We don’t want gaps or ‘kinks’ in the modeling function, so we want it to be continuous.

There are many functions that fit these requirements (for example, the error function). In practice, the choice is the sigmoid function for deep mathematical reasons; if you analyze a two-class distribution using Bayesian analysis, the sigmoid function appears as part of the posterior probability distribution [2]. That's beyond where I want to go for this blog post, so if you want to find out more, chase down the reference.

Mathematically, the sigmoid function is:

\[\sigma(x) = \dfrac{1}{1 + e^{-x}} \]

And graphically, it looks like this:

I’ve shown the sigmoid function in one dimension, as a function of $x$. It’s important to realize that the sigmoid function can have multiple parameters (e.g. $\sigma(x, y, z)$), it’s just much, much harder to draw.

The sigmoid and the odds

We can write the odds as:

\[odds = \dfrac{1}{1-p}\]

Taking the natural log of both sides (this is called the logit function):

\[ln(odds) = ln \left( \dfrac{1}{1-p} \right)\]

In machine learning, we're building a prediction function from $n$ features $x$, so we can write:

\[\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n\]

For reasons I'll explain later, this is the log odds:

\[\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n = ln \left( \dfrac{1}{1-p} \right)\]

With a little tedious rearranging, this becomes:

\[p = \dfrac{1}{1 + e^{-(w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n)}}\]

Which is exactly the sigmoid function I showed you earlier.

So the probability $p$ is modeled by the sigmoid function.

This is the "derivation" provided in most courses and textbooks, but it ought to leave you unsatisfied. The key point is unexplained, why is the log odds the function $w_1 \cdot x_1 + w_2 \cdot x_2 \cdots + w_n \cdot x_n $?

The answer is complicated and relies on a Bayesian analysis [3]. Remember, logistic regression is taught before Bayesian analysis, so lecturers or authors have a choice; either divert into Bayesian analysis, or use a hand-waving derivation like the one I've used above. Neither choice is good. I'm not going to go into Bayes here, I'll just refer you to more advanced references if you're interested [4].

Sigmoid to classification

In the previous section, I told you that we calculate a probability value. How does that relate to classification? Let's take an example.

Imagine two teams, A and B playing a game. The probability of team A winning is $p(A)$ and the probability of team B winning is $p(B)$. From probability theory, we know that $p(A) + p(B) = 1$, which we can rearrange as $p(B) = 1 - p(A)$. Let's say we're running a simulation of this game with the probability $p = p(A)$. So when p is "close" to 1, we say A will win and when p is close to 0, we say B will win.

What do we mean by close? By "default", we might say that if $p >= 0.5$ then we chose A and if $p < 0.5$ we chose B. That seems sensible and it's the default choice of scikit-learn as we'll see, but it is possible to select other thresholds.

(Don't worry about the difference between $p >= 0.5$ and $p < 0.5$ - that only becomes an issue under very specific circumstances.)

Features and functions

Before we dive into an example of using logistic regression, it's worth a quick detour to talk about some of the properties of the sigmoid function.

The y axis varies from 0 to 1.
The x axis varies from $-\infty$ to $\infty$
The gradient changes rapidly around $x=0$ but much more slowly as you move away from zero. In fact, once you go past $x=5$ or $x=-5$ the curve pretty much flattens. This can be a problem for some models.
The "transition region" between $y=0$ and $y=1$ is quite narrow, meaning we "should" be able to assign probabilities away from $p=0.5$ most of the time, in other words, we can make strong predictions about classification.

How logistic regression works

Calculating a cost function is key, however, it does involve some math that would take several pages and I don't want to turn this into a huge blog post. There are a number of blog posts online that delve into the details if you want more, checkout references [7, 8].

In linear regression, the method used to minimize the cost function is gradient descent (or a similar method like ADAM). That's not the case with logistic regression. Instead we use something called maximum likelihood estimation, and as its name suggests, this is based on maximizing the likelihood our model will predict the data we see. This approach relies on calculating a log likelihood function and using a gradient ascent method to maximize likelihood. This is an iterative process. You can read more in references [5, 6].

Some code

I'm not going to show you a full set of code, but I am going to show you the "edited highlights". I created an example for this blog post, but all the ancillary stuff got in the way of what I wanted to tell you, so I just pulled out the pieces I thought that would be most helpful. For context, my code generates some data and attempts to classify it.

There are multiple libraries on Python that have logistic regression, I'm going to focus on the one most people use to explore ideas, scikit-learn.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

train_test_split splits the data into a test set and training set. I'm not going to show how that works, it's pretty standard,

Machine learning algorithms tend to work better when the features are scaled. A lot of the time, this isn't an issue, but if the values of features range very, very differently, this can be an issue for the numeric algorithms. Here's an example: let's say feature 1 ranges from 0.001 to 0.002 and feature 2 ranges from 1,000,000 to 2,000,000, then we may have a problem. The solution is to scale the features over the same 0 to 1 range. Notably, scaling is also a problem for many curve fitting type algorithms too. Here's the scaling code for my simple example:

    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features)

Fitting is simply calling the fit method on the LogisticRegression model, so:

    # Create and train scikit-learn logistic regression model
    model = LogisticRegression(
        random_state=random_state,
        max_iter=max_iterations,
        solver='liblinear'  
    )
    
    # Train the model on scaled features
    model.fit(features_scaled, labels)

As you might expect, max_iter stops the fitting process from going on forever. random_state controls the random number generator used; it's only applicable to some solvers like the 'liblinear' one I've used here. The solver is the type of equation solver used. There's a choice of different solvers which have different properties and are therefore good for different sorts of data, I've chosen 'liblinear' because it's simple.

fit works exactly as you think it might.

Here's how we make predictions with the test and training data sets:

        test_features_scaled = scaler.transform(test_features)
        train_features_scaled = scaler.transform(train_features)
        
        train_predictions = model.predict(train_features_scaled)
        test_predictions = model.predict(test_features_scaled)

This is pretty straightforward, but I want to draw your attention to the scaling going on here. Remember, we scaled the features when we created the model, so we have to scale the features when we're making predictions.

The predict method uses a 0.5 threshold as I explained earlier. If we'd wanted another threshold, say 0.7, we would have used the predict_proba method.

We can measure how good our model is with the function accuracy_score.

        train_accuracy = accuracy_score(train_labels, train_predictions)
        test_accuracy = accuracy_score(test_labels, test_predictions)

This gives a simple number for the accuracy of the train and test set predictions.

You can get a more detailed report using classification_report:

        classification_report(test_labels, test_predictions)

This gives a set of various "correctness" measures.

Here's a summary of the stages:

Test/train split
Scaling
Fit the model
Predict results
Check the accuracy of the prediction.

Some issues with the sigmoid

Logistic regression is core to neural nets (it's all in the activation function), and as you know, neural nets have exploded in popularity. So any issues with logistic regression take on an outsize importance.

Sigmoids suffers from the "vanishing gradient" problem I hinted at earlier. As $x$ becomes more positive or negative, the $y$ value gets closer to 0 or 1, so the gradient (first derivative) gets smaller and smaller. In turn, this can make training deep neural nets harder.

Sigmoids aren't zero centered, which can cause problems for modeling some systems.

Exponential calculations cost more computation time than other, simpler functions. If you have thousands, or evens millions of nets, that soon adds up.

Fortunately, sigmoids aren't the only game in town. There are a number of alternatives to the sigmoid, but I won't go into them here. You should just know they exist.

Beyond binary

In this post, I've talked about simple binary classification. The formula and examples I've given all revolve around simple binary splits. But what if you want to classify something into three or more buckets? Logistic regression can be extended for more than two possible outputs and can be extended to the case where the outputs are ordered (ordinal).

In practice, we use more or less the same code we used for the binary classification case, but we make slightly different calls to the LogisticRegression function. The scikit-learn documentation has a really nice three-way classification demo you can see here: https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_multinomial.html.

What did Manus say?

Previously, I asked Manus to give me a report on logistic regression. I thought it's results were OK, but I thought I could do better. Here's what Manus did: https://blog.engora.com/2025/05/the-importance-of-logistic-regression.html, and of course, you're reading my take.

Manus got the main points of logistic regression, but over emphasized some areas and glossed over others. It was a B+ effort I thought. Digging into it, I can see Manus reported back on the consensus of the blogs and articles out there on the web. That's fine (the "wisdom of the crowd"), but it's limited. There's a lot of repetition and low-quality content out there, and Manus reflected that. It missed nuances because most of the stuff out there did too.

The code Manus generated was good and it's explanation of the code was good. It did miss explaining some things I thought were important, but on the whole I was happy with it.

Overall, I'm still very bullish on Manus. It's a great place to start and may even be enough of itself for many people, but if you really want to know what's going on, you have to do the work.

References

[1] Sperandei S. Understanding logistic regression analysis. Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi: 10.11613/BM.2014.003. PMID: 24627710; PMCID: PMC3936971.

[2] Bishop, C.M. and Nasrabadi, N.M., 2006. Pattern recognition and machine learning (Vol. 4, No. 4, p. 738). New York: springer.

[3] https://www.dailydoseofds.com/why-do-we-use-sigmoid-in-logistic-regression/

[4] Norton, E.C. and Dowd, B.E., 2018. Log odds and the interpretation of logit models. Health services research, 53(2), pp.859-878.

[5] https://www.geeksforgeeks.org/machine-learning/understanding-logistic-regression/

[6] https://www.countbayesie.com/blog/2019/6/12/logistic-regression-from-bayes-theorem

[7] https://medium.com/analytics-vidhya/derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

[8] https://medium.com/data-science/introduction-to-logistic-regression-66248243c148

Monday, June 16, 2025

Tell me on a Sundai.club – something novel in Boston?

At several events in the Boston area, I heard talk of something called the Sundai Club, a weekly AI hackathon for MIT and Harvard students. At the AI Tinkerers group, I saw some of their projects and I was impressed. This blog post is about the club and what I’ve observed from their presentations and from their code.

(Canva)

What impressed me

During the AI Tinkerers event, I saw several demos of “products” created by small teams of Sandai Club undergraduate students in 12 hours. Of course, all of the demos used AI, either to do processing in the background and/or for code generation. These demos were good enough to clearly demonstrate a value proposition.

Let me repeat this because it’s important. A small group of undergraduate students are regularly building working prototypes in 12 hours. The impressive thing is the productivity and the quality coming from students.

Of course, the output is a prototype, but with AI, they’ve got a substantial productivity boost. All the UIs looked good and all the prototypes did something interesting.

I was impressed enough to dig deeper, hence this review.

How the club operates

This is a student club for MIT and Harvard students. It meets every Sunday from 10am to 10pm for a full day’s hacking. Not all the 12 hours is spent hacking, there’s a sunset run and presentations. Some of the sessions are sponsored by AI companies or companies in the adjacent space. Sponsorship often means providing free compute resources for example, computing power or hosting etc.

They have a website you can visit: https://www.sundai.club/

My review of their code

Most of the projects are posted on the website and of those, most have GitHub pages where you can view the code. I spent some time dissecting several projects to figure out what’s going on. Here are my thoughts.

Code quality is surprisingly good. It’s readable and well-structured. Is this because it’s at least partly AI generated? Probably.

Code length is surprisingly short. You can read over all the code for one of these projects in less than 10 minutes.

Notably, they do use a lot of “new” services. This includes newer libraries and newer hosting services. This is a hidden benefit: their development speed isn’t just from AI, it’s from using the right (non-AI) tools.

LLM connections are simple. It’s just API calls and prompts. This was the surprise for me, I was expecting something more complicated.

Importantly, they use agentic AI IDEs. Cursor was the one I saw used the most, but I’ve heard of projects using Lovable and I’m sure there’s Windsurf usage too. In fact, a Sundai club presentation was the first time I saw people “vibe coding” using voice (via the Whisper add-on). Agentic IDEs seem to be key to the productivity gains I saw.

Why is this so interesting

They’re producing prototype “products” in less than 12 hours with a small team. This would have taken more than two or three weeks in the past.
The quality of the code is high. It’s at least as good as some professional code.
They’re using the latest libraries, IDEs, and tools. They really are on the cutting edge.

Next steps

The most obvious thing you can do is visit their website: https://www.sundai.club/ and view some of their projects.

If you’re in the Boston area, you can often catch Sundai Club presentations at the AI Tinkerers group, which is open to anyone: https://boston.aitinkerers.org/

Saturday, June 7, 2025

How to Thrive in an AI-Driven Future

On Wednesday, I went to a panel session in Boston on AI. I thought my notes might be useful to others, so here they are. The title of the panel was "How to Thrive in an AI-Driven Future", the thing thriving is the city of Boston and the surrounding areas.

What was the panel about?

The panel was about the current state of AI in the Boston area, focused on how Boston might become a hub for AI in the near term. It discussed the Boston areas' strengths and weaknesses, and along the way, it pointed out a number of great AI resources in the local area.

(King of Hearts, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

It was held in the Boston Museum of Science on Wednesday June 4th, 2025.

Who was on the panel?

Paul Baier, CEO GAI Insights - moderator
Sabrina Mansur, new Executive Director of Mass AI Hub
Jay Ash, CEO of MACP, Massachusetts Competitive Partnership
Chloe Fang, President of MIT Sloan AI Club

What did the panel talk about?

The panel was at its most comfortable talking about universities and students. There was a lot of chatter about Harvard and MIT (the event was sponsored by Harvard Business School Alumni) and the 435,000 students in the state. There was also mention of the state's great educational record. Jay brought up the term "the brain state" for Massachusetts.

Apparently, there are about 100 business incubators in MA housing 7,500 companies, with 20,000 employees. These are bigger numbers than I would have expected.

Several panel members mentioned the Massachusetts Green Higher Performance Computing Center in Holyoke. I didn't know about it and it's nice to hear about these kinds of initiatives, but no-one connected it to promoting or developing AI in the state.

Sabrina talked about the state releasing some data sets in the near future as data commons. The panel all agreed this would be a great step forward. It wasn't clear what data was going to be released, when, and how, but comments later on seemed to indicate this would be Massachusetts data only.

A good deal was made of the state's $100mn AI Hub initiative, but it seems like this money has been approved but not allocated and it's not clear what it will be spent on and when. There was a hint that there might be some focus on SMBs rather than large businesses.

Chloe talked about how AI and code gen has enabled new people players. She said that a few years ago, MBA students didn't have the technical skills to build demo products, but now, with the rise of code gen, they can. She talked about MBA hackathons, something that would have been impossible until recently.

The whole panel seemed to have a love affair with the MIT and Harvard Sundai Club. This is a student club that meets on a Sunday and produces complete apps in a 12 hour period, obviously focused on AI. (I agree, there's some very interesting things going on there.)

There was some discussion on making regulation in the state appropriate, but no discussion about what that might mean.

While there was a lot of discussion on problems, there were strikingly few ideas on how to resolve them. Two issues in particular came up:

Funding
Livability

The panel contrasted how "easy" it is to get funding in San Francisco compared to Boston, and that's both at the early stage and the growth stage. There were some comments that this view is overblown and that it's easier than people think to get funding in the Boston area. Frankly, there were no real suggestions on how to change things. One ideas was to tell students in Boston that it's possible to get funding here, but that's about the only suggestion that panel had.

There were a couple of questions around livability. An audience question pointed out that rents in the Boston area are high (though San Francisco and New York rents are probably higher), but the panel dodged the question. On the subject of "things to do for twenty-somethings", the panel deferred to the youngest panel member, but again, nothing substantive was said. The panelists did talk about Boston being an international city and how its downtown doesn't really live up to that right now; the view was, Boston city government needed to step up.

Boston AI Week, which is being held in the Fall, was heavily promoted.

What were my take-aways?

While there was a lot of discussion on problems, there were strikingly few ideas on how to resolve them and I'm not sure then panel had thought the issues through.

MIT and Harvard (in that order), dominate the intellectual landscape and mind share. They certainly dominated the panel's thinking. In my view, this is fair and the other universities only have themselves to blame for being left behind. While they don't have the resources of Harvard and MIT, they could run the equivalent of the Sundai Club, and they could put people up for panel sessions like this. They could also organize events etc. Yes, it's harder for them, and yes MIT and Harvard have more resources, but they could still do a lot more.

I was left with the feeling that there's no real coordination behind Boston's AI groups. While there are individuals doing great things (Paul Baier being one), I don't get the sense of an overarching and coordinated strategy.

(Two of the three trains I had to catch to get home.)

The International city thing struck a chord with me. My trip in was easy, I parked up and got one train right to the door of the Museum of Science. On the way back, things went wrong. I had to get three trains and a shuttle bus (almost two hours door-to-door, shocking). Nothing about my return trip said "international city".

Friday, June 6, 2025

Google's ADK for agentic AI development - and some general thoughts

Some observations on agent development

On Tuesday, June 3rd, 2025, I spent the day at Google's Cambridge, MA site at their "Build with AI" event. It was a hands-on tutorial to make agentic AI systems using Google's technology. The event crystallized a few things for me and helped sharpen my thinking.

In this blog post, I'm going to review the workshop and talk about my general thoughts

(Kenneth C. Zirkel, CC BY 4.0 <https://creativecommons.org/licenses/by/4.0>, via Wikimedia Commons)

Workshop review

The goal of the workshop was to build a working agentic AI system using MCP, A2A, and the Google Agent Development Kit (ADK). Of course, this was all done using GCP.

The session started with an overview and some theory. Thankfully, this was done well; the team kept the introductions short and dived straight into the workshop. The theory was standard stuff, an introduction to the technologies used and some of the relevant history.

Something like 70% of the workshop was setting up various Google services, for example, a web server to serve the app, a server to serve the backend, and so on. Thankfully, this was all script based, but there were a lot of scripts. This really brought home to me the role of DevOps in AI. Someone asked about AI ops, and I agree with the question, it all felt like an outgrowth of DevOps.

The Python code we did use was pretty simple. Frankly, it was just a few API calls. The focus was on the API call arguments, making sure we had the right arguments in place for what we were trying to do. I'm going to go as far as saying that the Python coding piece was trivial; there was nothing that would cause problems even for an entry-level programmer. It was made even easier by being cut and paste, we didn't even have to figure out the right arguments.

The presenter was keen to point out the message passing between servers and how we could debug it through the Google environment. This was my concern. I've tried to debug message passing between independent systems before, and it wasn't a good experience. Having Google provide a "trace" is very helpful and reduces my concerns quite a bit.

The workshop took about four hours and I managed to build the complete system a little while before the end.

Overall, I enjoyed it and got a lot out of it. Could I build their demo system from scratch by myself? No. The reason is, all the setup that needs to be done with the various servers. It's not at all clear to me the why behind some of the config scripts. But note the problem is not a data science one, or a software one, it's a DevOps problem. Do I feel I understand A2A and MCP better? Yes. Do I recommend the workshop? Yes.

The workshop is called "Build with AI" and it's going on the road soon.

General thoughts

Agentic systems are not the preserve of data scientists any more. In fact, it's hard to understand what benefits a data scientist would bring to the table.

Over the last year, the development of various abstractions, for example, A2A, MCP, LangChain, etc. have made it much easier to build AI systems. We've got to the stage where these things are pretty much "off-the-shelf" APIs. With one glaring exception, AI, and agentic AI in particular, now looks like a software engineering problem, so it feels like the preserve of software engineers.

Because frameworks like MCP and A2A are all about inter-system communication, message passing is now key. Frameworks all use some form of JSON message passing underneath. This make debugging much harder and means we need to see what messages have been passed between systems. To their credit, Google knows this and has produced software to let you trace messages. Debugging message passing is still new to many software engineers and I expect some problems, even with Google's tools.

AI systems are all about calls from one system to another. This obviously means permissioning, but it also means cost. A poor set up can cost a company a great deal of money and/or give a poor user experience. These kinds of problems are usually associated with DevOps. In fact, my overall impression of the Google workshop was it was mostly DevOps with some basic coding thrown in.

In mid-2025, what skills do you need to develop agentic AI systems?

Software engineering
Devops.

There's no requirement for data science. In fact, you don't need to know how any of the LLMs work under the hood.

This is a brave new world.

Thursday, June 5, 2025

Cursor for data science - a scorecard

What is this scorecard?

I've been investigating how to use Cursor for data science. This means using it on a real project and finding out its strengths and weaknesses. This blog post is a summary of my experiences and I'm posting it as a guide to others.

(Gemini)

Things in this space are changing quickly. This post is up to date as of June 2025. I may update this post in the future, but if you're reading this six months in the future and it hasn't been updated, please contact me if you want to hear more (https://www.linkedin.com/in/mikewoodward/).

Cursor scorecard

General

Area	Grade
Getting started	D
Usability	B
Debugging	C
Code generation	C
Code completion	A
Code commenting	A
Code tidying	D
PEP8 compliance	B
Documentation	A
GitHub integration	C
Error finding	B

Specific tasks

Area	Grade
Pandas dataframe manipulation	C
Web scraping	D
Data cleansing	C
Prototyping	A

Getting started with Cursor

Getting started is hard. This is very definitely an early adopter tool:

Product documentation is sparse.
There are very few online written tutorials.
There are a handful of courses, but only on Udemy.
Although there are many, many videos on YouTube, there are problems with them.

All of the YouTube videos I watched followed the same format, the development of a UI-based app. In all cases, the videos showed connections to LLMs to do some form of text processing, and in some cases, videos went through the process of connecting to databases, but none of the videos showed any significant (data science) computation in Python. On reflection, pretty much every Cursor demo I’ve seen has been focused on prototyping. That's fine if your application is a prototype, but not so great otherwise.

I got started by watching videos, talking to people at Meetup groups, and working on this project. That’s great for me, but it’s not scalable.

Although the Cursor free tier is useful, you very quickly exhaust your free tokens. To do any form of evaluation, you need a subscription. It’s cheap enough for that not to be a problem, but you should be aware you’ll need to spend some money.

Usability

The obvious problem is that Cursor isn’t a notebook. Given that most data scientists are addicted to notebooks (with good reason), it’s a major stumbling block any data science roll-out will have to deal with. In fact, it may well stop data science adoption dead in its tracks in some organizations.

Once you get round the notebook issue, usability is mostly good, but it’s a mixed bag. There are settings like rules which should be easier and more obvious to set up; the fact you an specify rules in “natural” English feels like a benefit, but I’d rather have something more restrictive that’s less open to interpretation. Rules have a bit of a voodoo flavor right now.

Debugging

Frankly, I found debugging harder than other environments. I missed having notebook-like features. There’s a variable explorer, but it’s weaker than in an IDE like Spyder. On the plus side, you can set breakpoints and step through the code.

Code generation

Very, very mixed results here.

Bottom line: code generation often can’t be trusted for anything technical and requires manual review. However for commodity tasks, it does very well.

Positives

It did outstandingly well at generating a UI in Streamlit. The code was a little old-fashioned and didn’t use the latest features, but it got me to a working solution astonishingly fast.

It produces ‘framework’ code really well and saved a lot of time. For example, I wanted to save results to a CSV and save intermediate results. It generated that code for me in seconds. Similarly, I wanted to create ‘commodity’ functions to do relatively simple tasks, and it generated them very quickly. It can automate much of the ‘boring’ coding work.

It also did well on some low-level and obscure tasks that would have otherwise requires some time on Stack Overflow, e.g. date conversion.

Negatives

Technical code generation is not a good story. With very careful prompting, it got me to an acceptable solution for statistics-oriented code. But I had to check the code carefully. Several times, it produced code that was either flat-out wrong or just a really bad implementation.

I found that code that required details instructions (e.g. specific dataframe joins) could be generated, but given how detailed the prompt needed to be, the cost savings for code generation were minimal.

On occasions, code generation gave overly-complex solutions to simple tasks, for example, its solution for changing the text “an example” to “An Example” was a function using a loop.

From a higher-level code structure perspective, code generation is not good. Persistently, it would create new functions rather than generalizing and re-using existing functions. For example, I had boiler-plate code to open a CSV and read it into a Pandas dataframe with error checking. Code generation created a new function to read in data rather than re-use the existing code. Once I told it to consolidate all the read functions, it did. Overall, it’s not good at generating well-structured code.

Although it’s a niche topic, it’s worth mentioning that code generation didn’t work at all well for web scraping.

Code completion

Excellent. Best I’ve come across.

There were several cases where code generation didn’t work very well, but code completion did. Code completion works well if the context is good, for example, if you create a clear comment, the system will offer code completion based on your comment, and almost all the time, it will do well.

I found code completion to be a very compelling feature.

Commenting code

This is definitely a Cursor superpower. It’s almost unbelievably good.

Code tidying

Some of the time, if you ask it to tidy your code, it will do the right thing. However, most of the time I found it introduces errors.

PEP8 compliance

Surprisingly, generated code/completion code isn’t PEP8 ‘out of the box’, for example, it will happily give you code that’s way over 79 characters. Even asking the AI to make the code PEP8 compliant sometimes takes multiple attempts. I had set a rule for PEP8 compliance, but it still did didn’t fully comply.

Documentation

This means creating markdown files that explain what the code is doing. It did a really great job here.

GitHub integration

Setup was really easy. Usage was mostly OK, but I ran into a few issues where Cursor needlessly tied itself in knots. More seriously, it deleted a bunch of data files.

Contrasting the usability of GitHub in Cursor with the GitHub desktop app, the GitHub desktop app has the edge right now.

Github integration needs some work.

Error finding

In most cases, it did really well finding and correcting run-time errors, however I found a case where its error correction made the code much worse; this was processing a complex HTML table. Code generation couldn’t give me a correct answer, and asking the engine (Claude) to correct the error just produced worse code.

Pandas dataframe manipulation

This means the ability to manipulate Pandas dataframes in non-trivial ways, for example, using groupby correctly.

Cursor can do it quite well for basic manipulations, but it fails for even moderately complicated tasks. For example, I asked it to find cases where a club only appeared as an away team or a home team. The generated code looked as if it might be correct, but it wasn’t. In fact, the code didn’t work at all and I had to write it by hand. This was by no means a one-off, Cursor consistently failed to produce correct code for dataframe manipulations.

Code generation for scraping data

On the plus side, it managed to give me the URLs for the pages I wanted to scrape purely on a prompt, which frankly felt a bit supernatural.

On the negative side, it really can’t generate code that works for anything other than a simple scrape. Even asking it to correct its errors doesn’t work very well. The general code structure was OK, but a little too restrictive and I had to remove some of its generated functions. It’s marginal to me whether it’s really worth using code generation here. However, code completion was helpful.

Data cleansing

Cleaning data with code generation ran into the Pandas dataframe problem I've discussed above. Code completion was helpful, but once the manipulations become more complex, I had to hand write them.

Prototyping

By prototyping, I mean creating a UI-based application, for example, a Streamlit app, or even a standalone web app using react.js with a Python backend.

The results were outstanding.

You can generate apps in a fraction of the time it takes to do it by hand.

There are some downsides:

Security is often not baked-in and has to be added later.
The code often uses structures that are a little behind the latest thinking, e.g. not using new features of libraries.

Wednesday, June 4, 2025

Recommendations for rolling out generative AI to data science and technical coding teams

Summary - proceed with caution

This report gives guidance for rolling out code generation to data science teams. One size doesn't fit all, so you should use the post as a guide to shape your thinking, not as a recipe that can't be changed.

There are substantial productivity gains to be had from rolling out generative AI for code generation to data science teams, but there are major issues to be managed and overcome. Without effective leadership, including expectation setting, roll-outs will fail.

Replacing notebooks with an agentic AI like Cursor will not succeed. The most successful strategy is likely the combined use of notebooks and an agentic AI IDE which will give data scientists an understanding of the benefits of the technology and its limitations. This is in preparation for the probable appearance of agentic notebook products in the near future.

For groups that use IDEs (like software developers), I recommend immediate use of Cursor or one of its competitors. I'm covering this in a separate report.

(Perplexity.AI)

Introduction

Why, who, and how

This is a guide for rolling out generative AI (meaning code generation) for data science teams. It covers the benefits you might expect to see, the issues you'll encounter, and some suggestions for coping with them.

My comments and recommendations are based on my use of Cursor (an agentic IDE) along with Claude, Open AI and other code generation LLMs. I'm using them on multiple data science projects.

As of June 2025, there are no data science agentic AI notebooks that have reached widespread adoption, however, in my opinion, that's likely to change later on in 2025. Data science teams that understand the use of agentic AI for code generation will have an advantage over teams that do not, so early adoption is important.

Although I'm focused on data science, all my comments apply to anyone doing technical coding, by which I mean code that's algorithmically complex or uses "advanced" statistics. This can include people with the job titles "Analyst" or "Software Engineer".

I'm aware that not everyone knows what Cursor and the other agentic AI-enabled IDEs are, so I'm writing a separate blog post about them.

(Gemini)

The situation for software engineers

For more traditional software engineering roles, agentic AI IDEs offer substantial advantages and don't suffer from the "not a notebook" problem. Despite some of the limitations and drawbacks of code generation, the gains are such that I recommend an immediate managed, and thoughtful roll-out. A managed and thoughtful roll-out means setting realistic goals, having proper training, and clear communications.

Realistic goals covers productivity gains; promising productivity gains of 100% or more is unrealistic.
Proper training means educating the team on when to use code gen and when not to use it.
Clear communications means the team must be able to share their experiences and learn from one another during the roll-out phase.

I have written a separate report for software engineering deployment.

Benefits for data science

Cursor can automate a lot of the "boring" stuff that consumes data scientist's time, but isn't core algorithm development (the main thing they're paid to do). Here's a list:

Commenting code. This includes function commenting using, for example, the Google function documentation format.
Documentation. This means documenting how code works and how it's structured, e.g. create a markdown file explaining how the code base works.
Boilerplate code. This includes code like reading in data from a data source.
Test harnesses, test code, and test data. Code generation is excellent at generating regression test frameworks, including test data.
PEP8 compliance. Cursor can restructure code to meet PEP8 requirements.

There are other key advantages too:

Code completion. Given a comment or a specifc prompts, Cursor can generate code blocks, including using the correct API parameters. This means less time looking up how to use APIs.
Code generation. Cursor can generate the outline of functions and much of the functionality, but this has to be well-managed.

Overall, if used correctly, Cursor can give a significant productivity boost for data science teams.

Problems for data science

It's not plain sailing, there are several issue to overcome to get productivity benefits. You should be aware of them and have a plan to address them.

It's not a notebook

(Gemini)

On the whole, data scientists don't use IDEs, they use notebooks. Cursor, and all the other agentic IDEs, are not notebooks. This is the most important issue to deal with and it's probably going to be the biggest cause of roll-out failure.

Notebooks have features that IDEs don't, specifically the ability to do "data interactive" development and debugging; which is the key reason why data scientists use them. Unfortunately, none of the agentic AI systems have anything that comes close to a notebook's power. Cursor's debugging is not AI enabled and does not easily allow notebook cell-like data investigations.

Getting data scientists to abandon notebooks and move wholesale to an agentic IDE like Cursor is an uphill task and is unlikely to succeed.

A realistic view of code generation for data science

Complex code is not a good match

Cursor and LLMs in general, are bad at generating technically complex code, e.g. code using "advanced statistical" methods. For example, asking for code to demonstrate random variable convolution can sometimes yield weird and wrong answers. The correctness of the solution depends precisely on the prompt. It also needs the data scientist to closely review the generated code. Given that you need to know the answer and you need to experiment to get the right prompt, the productivity gain of using code generation in these cases is very low or even negative.

It's also worth pointing out that for Python code generation, code gen works very poorly for Pandas dataframe manipulation beyond simple transformations.

Code completion

Code completion is slightly different from code generation and suffers from fewer problems, but it can sometimes yield crazily wrong code.

Data scientists are not software engineers and neither is Cursor

Data scientists focus on building algorithms, not on complete systems. In my experience, data scientists are bad at structuring code (e.g. functional decomposition), a situation made worse by notebooks. Neither Cursor, nor any of its competitors or LLMs, will make up for this shortcoming.

Refactoring is risky

Sometimes, code needs to be refactored. This means changing variable names, removing unused code, structuring code better, etc. From what I've seen, asking Cursor to do this can introduce serious errors. Although refactoring can be done successfully, it needs careful and limited AI prompting.

"Accept all" will lead to failure

I'm aware of real-world cases where junior staff have blindly accepted all generated code and it hasn't ended well. Bear in mind, generated code can sometimes be very wrong. All generated code (and code completion code) must be reviewed.

Code generation roll-out recommendations

Run a pilot program first

A successful roll-out will require some experience, but where does this experience come from? There are two possibilities:

"Hidden" experience. It's likely that some staff have experimented with AI code gen, even if they're not data scientists. You can co-opt this experience.

Running a pilot program. Get a small number of staff to experiment intensively for a short period.

Where possible, I recommend a short pilot program prior to any widespread roll-out. The program should use a small number of staff and run for a month. Here are some guidelines for running a pilot program:

Goals:

To learn the strengths and weaknesses of agentic AI code generation for data science.
To learn enough to train others.
To produce a first-pass "rules of engagement".

Staff:

Use experienced/senior staff only.
Use a small team, five people or less.
If you can, use people who have experimented with Cursor and/or code generation.
Don't use skeptics or people with a negative attitude.

Communication:

Frequent staff meetings to discuss learnings. Strong meeting leadership to ensure participation and sharing.
Slack (or the equivalent) channels.

Tasks:

Find a way of using agentic IDEs (e.g. Cursor) with notebooks. This is the most important task. The project will fail if you don't get a workable answer.
Work out "rules of engagement".
Work out how to train others.

Duration

Start to end, a month.

If you don't have any in-house experience, how do you "cold start" a pilot program? Here are my suggestions:

Go to local meetup.com events and see what others are doing.
Find people who have done this elsewhere (LinkedIn!) and pay them for advice.
Watch YouTube videos (but be aware, this is low-productivity exercise).

Don't try and roll-out AI code generation blind.

Expectation setting

There are some wild claims about productivity benefits for code generation. In some cases they're true, you really can substantially reduce the time and cost of some projects. But for other projects (especially data science projects) the savings are less. Overstating the benefits has several consequences:

Loss of credibility with company leadership.
Loss of credibility with staff and harm to morale.

You need to have a realistic sense of the impact on your projects. You need to set realistic expectations right from the start.

How can you get that realistic sense? Through a pilot program.

Clear goals and measuring success

All projects need clear goals and some form of success metric. The overall goal here is to increase productivity using code generation while avoiding the implementation issues. Direct measures of success here are hard as few organizations have measures of code productivity and data science projects vary wildly in complexity. Some measures might be:

Fraction of code with all functions documented correctly.

Fraction of projects with regression tests.

High levels of staff usage of agentic AI IDEs.

The ultimate measure is of course that projects are developed faster.

At an individual level, metrics might include:

Contributions to "rules of engagement".
Contributions to Slack channel (or the equivalent).

Initial briefing and on-going communications

(Canva)

Everyone in the process must have a realistic sense of the benefits of this technology and the problems, this includes the staff doing the work, their managers, and all executive and C-level staff.

Here are my suggestions:

Written briefing on benefits and problems.
Briefing meetings for all stakeholders.
Written "rules of engagement" stating how code is to be used and not used. These rules will be updated as the project proceeds.
Regular feedback sessions for hands-on participants. These sessions are where people share their experiences.
Regular reports to executives on project progress.
On-going communications forum. This could be something like a Slack channel.
Documentation hub. This is a single known place where users can go to get relevant materials, e.g.

Set-up instructions
Cursor rules (or the equivalent)
"Rules of engagement"

Clear lines of responsibility

Assuming there are multiple people involved in an evaluation or roll-out, we need to define who does what. For this project, this means:

One person to act as the (Cursor) rules controller. The quality of generated code depends on rules, if everyone uses wildly different rules the results will inconsistent. The rules controller will provide recommended rules that everyone should use. Participants can experiment with rules, but they must keep the controller informed.
One person to act as recommendations controller. As I've explained, there are "dos" and "don'ts" for working with code generation, these are the "rules of engagement". One person should be responsible for continually keeping this up to date.

Limits on project scope

There are multiple IDEs on the market and their are multiple LLMs that will generate code. Evaluating all of them will take considerable time and be expensive. My recommendation is to choose one IDE (e.g. Cursor, Windsurf, Lovable or one of the others) and one agentic AI. It's OK to have some experimentation at the boundaries, e.g. experimenting with a different agentic AIs, but this needs to be managed - as always, project discipline is important.

Training

(Canva)

Just setting people up and telling them to get started won't work. Almost all data scientists won't be familiar with Cursor and the VS Code IDE it's based on. Cursor works differently from other IDEs, and there's little in the way of useful tutorials online. This begs the question, how do you get the expertise to train your team?

The answer is a pilot program as I've explained. This should enable you to bootstrap your initial training needs using in-house experience.

You should record the training so everyone can access it later if they run into trouble. Training must include what not to do, including pointing out failure modes (e.g. blindly accepting generated code), this is the "rules of engagement".

It may also be worth re-training people partway through the project with the knowledge gained so far.

(Don't forget, data scientists mostly don't use IDEs, so part of your training must cover basic IDE usage.)

Notebook and Cursor working together

This is the core problem for data science. Figuring out a way of using an agentic IDE and a notebook together will be challenging. Here are my recommendations.

Find a way of ensuring the agentic IDE and the notebook can use the same code file. Most notebooks can read in Python files and there are sometimes ways of preserving cell boundaries in Python (e.g. using the "# %%" format).
Edit the same Python file in Cursor and in the notebook (this may mean refreshing the notebook so it picks up any changes, Cursor seems to pick up changes by itself).
Use Cursor for comments, code completion etc. Use the notebook for live code development and debugging.

(Canva)

Precisely how to do this will depend on the exact choice of agentic IDE and notebook.

This process is awkward, but it's the best of the options right now.

(Cursor) Rules

Agentic IDEs rely on a set of rules that guide code generation. These are like settings but expressed in English prose. These rules will help govern the style of the generated code. What these rules are called will vary from IDE to IDE but in Cursor, they're called "Rules".

I suggest you start with a minimal set of Rules, perhaps 10 or so. Here are three to get you started:

"Act as an experienced data scientist creating robust, re-usable, and readable code.
Use the latest Python features, including the walrus operator. Use list comprehensions rather than loops where it makes sense.
Use meaningful variable names. Do not use df as the name of a dataframe variable."

There are several sites online that suggest Rules. Most suggest verbose and long Rules. My experience is that shorter and more concise works better.

Regression tests

As part of the development process, use Cursor to generate test cases for your code, which includes generating test data. This is one of Cursor's superpowers and one of the places where you can see big productivity improvements.

Cursor can occasionally introduce errors into existing code. Part of the "rules of engagement" must be running regression tests periodically or when the IDE has made substantial changes. In traditional development, this is expensive, but agentic IDEs substantially reduce the cost.

GitHub

Cursor integrates with GitHub and you can update Git repositories with a single prompt. However, it can occasionally mess things up. You should have a good set of tactics for GitHub integration, including having an in-house expert who can fix issues should they arise.

"Rules of engagement"

I've referred to this document a number of times. This is a written document that describes how to use code gen AI and how not to use it. Here are the kinds of things it should contain:

"Use code generation via the prompt to create function and code outlines, e.g. specifying that a file will contain 5 functions with a description of what the functions do. Most of the time, it's better to ask the agent to product code stubs. However, if a function is boilerplate, e.g. reading a CSV file into a dataframe, then you can prompt for full code generation for that function.
...
Do not use code generation or code completion for medium to complex dataframe manipulations. You can use it for simple dataframe manipulations. You can use code completion to get a hint, but you shouldn't trust it.
...
Use the prompt to comment your code, but be clear in your prompt that you want comments only and no other changes.
...

Before running regression tests, prompt the AI to comment your code.

"

You should periodically update the rules of engagement and make sure users know the rules have changed. As I stated earlier, one person should be responsible for maintaining and updating the rules of engagement.

Conclusions

Successfully rolling out agentic AI code generation to data scientists is not a trivial tasks. It will require a combination of business and technical savvy. As ever, there are political waters to navigate, both up and down the organization.

There are some, key ideas I want to reiterate:

Agentic IDEs are not notebooks. You need to find a way of working that combines notebooks and IDEs. Success depends on this.
Pilot programs will let you bootstrap a roll-out, without them, you'll find roll-outs difficult to impossible.
Training, "rules of engagement", and communication are crucial.

Other resources

I'm in the process of developing a very detailed analysis of using Cursor for data science. This analysis would form the basis of the "rules of engagement". I'm also working on a document similar to this for more traditional software engineering. If you're interested in chatting, contact me on LinkedIn: https://www.linkedin.com/in/mikewoodward/.

Tuesday, May 27, 2025

What is Model Context Protocol?

Bottom line: MCP is an important technology, but as of May 2025, it's not ready for production deployment. It's immature, the documentation is poor, and it doesn't have the security features it needs. Unless your business has a compelling and immediate need for it, wait a while before starting experimentation.

I've been hearing a lot about MCP and how much of a game-changer it is, but there are three problems with most of the articles I've read:

They don't explain the what and the how very well.
They're either too technical or too high-level.
They smell too strongly of hype.

In this blog post, I'm going to dive into the why at a business level and do some of the how at a more technical level. This is going to be a hype free zone.

(Chat GPT generated)

What problem are we trying to solve?

AI systems need to access data, but data is accessed in a huge number of ways, making it harder for an AI to connect and use data. MCP is a way of presenting the 'same' interface for all data types.

There are many different data sources, for example: JSON files, CSV files, XML files, text files, different APIs, different database types, and so on. In any computer language, there are different ways of connecting to these data sources. Here are two Python code snippets that illustrate what I mean:

import requests

res = requests.get(
    url="https://www.gutenberg.org/files/132/132-h/132-h.htm",
    timeout=(10,5)

)

and:

import lxml

...

# Open the XML file and parse it
tree = lxml.etree.parse(zip_file_names[0])
...
# Completely parse the first element
root = tree.getroot()
children = root.getchildren()[0].getchildren()

There are a couple of important points here:

You use different Python libraries to access different data sources.
The API is different.
In some cases, the way you use the API is different (e.g. some sources use paging, others don't).

In other words, it can be time consuming and tricky to read in data from different sources.

This is bad enough if you're a programmer writing code to combine data from different sources, but it's even worse if you're an AI. An AI has to figure out what libraries to use, what data's available, whether or not to use paging, etc. In other words, different data source interfaces make life hard for people and for AIs.

There's a related problem, often called the NxM problem. Let's imagine there are M data sources and N LLMs. Each LLM has to create an interface to each data source, so we get a situation that looks like this:

(Claude generated)

This is a huge amount of duplication (NxM). What's worse is if a data source changes its API (e.g, an AWS API update) we have to change N LLMs. If we could find someway of standardizing the interface to the data sources, we would have one set of code for each LLM (M) and one set of code for each data source (N), transforming this into an N+M problem. In this new world, if a data source API is updated, this just means updating one wrapper. Can we find some way of standardizing the interfaces?

(In the old days, this was a problem for hardware too. Desktop PCs would have a display port, an ethernet port, a printer port, and so on. These have pretty much all been replaced with USB-C ports. Can we do something similar in software?)

Some background

There has been a move to consolidate the interface to different sources, but it's been very limited. In the Python world, there's a database access library that lets you connect to most databases using the same interface, but that's about it. Until now, there just hasn't been a strong enough motivation for the community to work out how to provide consistent data access.

I want to go on two slight tangents to explain ideas that are important to MCP. Without these tangents, the choice of name is hard to understand, as are the core ideas.

At the end of the 1970's, Trygve Reenskaug was working at Xerox Parc on UI problems and came up with the Model-View-Controller abstraction. The idea is, a system can be divided into conceptual parts. The Model part represents the business data and the business logic. There's a code interface (API) to the Model that the View and Controller use to access data and get things done.

The Model part of this abstraction corresponds to the data sources we've been talking about, but it generalizes them to include business logic (meaning, doing something like querying a database). This same abstraction is a feature of MCP too. Sadly, there's a naming conflict we have to discuss. Model means data in Model-View-Controller, but it's also part of the name "large language model" (LLM). In MCP, the M is Model, but it means LLM; the data and business logic is called Context. I'm going to use the word Context from now on to avoid confusion.

Let's introduce another key idea to understand MCP, that of the 'translation' or 'interface' layer. This is a well-known concept in software engineering and comes up a number of times. The best known example is the operating system (OS). An OS provides a standardized way of accessing the same functionality on different hardware. The diagram below shows a simple example. Different manufacturers make different disk drives, each with a slightly different way of controlling the drives. The operating system has a translation layer that offers the same set of commands to the user, regardless of who made the disk drive.

(Chat GPT generated)

Languages like Python rely on these translation layers to work on different hardware.

Let's summarize the key three ideas before we get to MCP:

There's been very little progress to standardize data access functionality.
The term Context refers to the underlying data and functionality related to that data.
Translation layer software allows the same operations to work on different machines.

What MCP is

MCP stands for Model Context Protocol. It's a translation layer on top of a data source that provides a consistent way of accessing different data sources and their associated tools. For example, you can access database data and text files data using the same interface.

The Model part of the acronym refers to the LLM. This could be Claude, Gemini, GPT, DeepSeek or one of the many other Large Language Models out there.
Context refers to the data and the tools to access it.
Protocol refers to the communication between the LLM and the data (Context).

Here's a diagram showing the idea.

What's interesting about this architecture is that the MCP translation layer is a server. More on this later.

In MCP terminology, users of the MCP are called Hosts (mostly LLMs and IDEs like Cursor or Windsurf, but it could be something else). Hosts have Clients that are connectors to Servers. A Host can have a number of Clients; it'll have one for each data source (Server) it connects to. A Server connects to a data source and uses the data source's API to collect data and perform tasks. A Server has functions the Client uses to identify the tasks the Server can perform. A Client communicates with the Server using a defined Protocol.

Here's an expended diagram providing a bit more detail.

I've talked about data sources like XML files etc., but it's important to point out that a data source could be Github, Slack, or Google Sheets or indeed any service. Each of these data sources has their own API and the MCP Server provides a standardizes way of using it. Note that the MCP Server could do some compute intensive tasks too, for example running a time-consuming SQL query on a database.

I'll give you an expanded example for how this all works, let's say a user asks the LLM (either standalone or in a tool like Cursor) to create a Github repo:

The Model, via its MCP Client, will ask the MCP Server for a list of capabilities for the Github service.
The MCP Server knows what it can do, so it will return a list of available actions, including the ability to create a repo.
The MCP Client will pass this data to the LLM.
Now the Model knows what Github actions it can perform and it can check it can do what the user asked (create a repo).
The LLM instructs its MCP Client to create the repo, which it in turn passes the request to the MCP Server, which in turn formats the request using the Github API. Github creates the repo and returns a status code to the MCP Sever, which in turn informs the Client which in turn informs the Host.

This is a lot of indirection, but it's needed for the whole stack to work.

This page: https://modelcontextprotocol.io/docs/concepts/architecture explains how the stack works in more detail.

How it works

How to set up the Host and Client

To understand the Host and Client set up, you need to understand that MCP is a communications standard (the Protocol part of the name). This means, we only have to tell the Client small amounts of information about the Server, most importantly, it's location. Once it knows where the Server is, it can talk to it.

In Cursor (a Host), there's an MCP setting where we can tell Cursor about the MCP Servers we want to connect to. Here's the JSON to connect to the Github MCP Server:

{
  "mcpServers": {
    "github": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "-e",
        "GITHUB_PERSONAL_ACCESS_TOKEN",
        "mcp/github"
      ],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "<YOUR_TOKEN>"
      }
    }
  }
}

In this example, the line "mcp/github" is the location of the GitHub MCP server.

Setup is similar for LLMs, for example, the Claude desktop.

I'm not going to explain the above code in detail (you should look here for details of how the Client works). You should note a couple of things:

It's very short.
It's terse.
It has some security (the Personal Access Token).

How to set up the MCP Server

MCP Servers have several core concepts:

Resources. They expose data to your Host (e.g. the LLM) and are intended for light-weight and quick queries that don't have side effects, e.g. a simple data retrieval.
Tools. They let the Host tell the Server to take an action. They can be computationally expensive and can have side effects.
Prompts. These are templates that standardize common interactions.
Roots and Sampling. These are more advanced and I'm not going to discuss them here.

These are implemented in code using Python function decorators, a relatively new features of Python.

Regardless of whether it's Prompts, Tools, or Resources, the Client has to discover them, meaning, it has to know what functionality is available. This is done using discovery functions called list_resources, list_prompts, and of course list_tools. So the Client calls the discovery functions to find out what's available and then calls the appropriate functions when it needs to do something.

Resources

Here's are two examples of resource function. The first function lets the Client find out what resources are available, which in this case is a single resource, the application log. The second function is how the Client can access the application log contents.

@app.list_resources()
async def list_resources() -> list[types.Resource]:
    return [
        types.Resource(
            uri="file:///logs/app.log",
            name="Application Logs",
            mimeType="text/plain"
        )
    ]

@app.read_resource()
async def read_resource(uri: AnyUrl) -> str:
    if str(uri) == "file:///logs/app.log":
        log_contents = await read_log_file()
        return log_contents

    raise ValueError("Resource not found")

Note the use of async and the decorator. The async allows us to write efficient code for tasks that may take some time to complete.

Tools

Here's an example of two tool functions. As you might expect by now, the first function lets the Client discover which tools it can call.

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="calculate_sum",
            description="Add two numbers together",
            inputSchema={
                "type": "object",
                "properties": {
                    "a": {"type": "number"},
                    "b": {"type": "number"}
                },
                "required": ["a", "b"]
            }
        )
    ]

The second function is a function the Client can call once the Client has discovered it.

@mcp.tool()
async def fetch_weather(city: str) -> str:
    """Fetch current weather for a city"""
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://api.weather.com/{city}")
        return response.text

Here, the code is calling out to an external API to retrieve the weather for a city. Because the external API might take some time, the code uses await and async. This is a tool rather than a resource because it may take some time to complete.

Prompts

This is a longer code snippet to give you the idea. The list_prompts function is key: this is how the Client finds out the available prompts.

PROMPTS = {
    "git-commit": types.Prompt(
        name="git-commit",
        description="Generate a Git commit message",
        arguments=[
            types.PromptArgument(
                name="changes",
                description="Git diff or description of changes",
                required=True
            )
        ],
    ),
    "explain-code": types.Prompt(
        name="explain-code",
        description="Explain how code works",
        arguments=[
            types.PromptArgument(
                name="code",
                description="Code to explain",
                required=True
            ),
            types.PromptArgument(
                name="language",
                description="Programming language",
                required=False
            )
        ],
    )
}
...
@app.list_prompts()
async def list_prompts() -> list[types.Prompt]:
    return list(PROMPTS.values())
...

@app.get_prompt()
async def get_prompt(
    name: str, arguments: dict[str, str] | None = None
) -> types.GetPromptResult:
    if name not in PROMPTS:
        raise ValueError(f"Prompt not found: {name}")

    if name == "git-commit":
        changes = arguments.get("changes") if arguments else ""
        return types.GetPromptResult(
            messages=[
                types.PromptMessage(
                    role="user",
                    content=types.TextContent(
                        type="text",
                        text=f"Generate a concise but descriptive commit message "
                        f"for these changes:\n\n{changes}"
                    )
                )
            ]
        )

You can read more about how prompts work in the documentation: https://modelcontextprotocol.io/docs/concepts/prompts#python

Messages everywhere

The whole chain of indirection relies on JSON message passing between code running in different processes. This can be difficult to debug. You can read more about MCP's message passing here: https://modelcontextprotocol.io/docs/concepts/transports

Documents, tutorials, and YouTube

At the time of writing (May 2025), the documentation for MCP is very sparse and lacks a lot of detail. There are a few tutorials people have written, but they're quite basic and again lack detail. What this means is, you're likely to run into issues that may take time to resolve.

There are videos on YouTube, but most of them have little technical content and seem to be hyping the technology rather than offering a thoughtful critique or a guide to implementation. Frankly, don't bother with them.

Skills needed

This is something I've hinted at in this blog post, but I'm going to say it explicitly. The skill level needed to implement a non-trivial MCP is high. Here's why:

The default setup process involves using uv rather than the usual pip.
The MCP API makes extensive use of function decorators, an advanced Python feature.
The Tools API uses async and await, again more advanced features.
Debugging can be hard because MCP relies on message passing.

The engineer needs to know about function decorators, asynchronous Python, and message passing between processes.

Where did MCP come from?

MCP was released by Anthropic in November 2024. After a "slowish" start, it's been widely adopted and has now become the dominant standard. Anthropic have open-sourced the entire protocol and placed it on GitHub. Frankly, I don't see anything usurping it in the short term.

Security and cost

This is a major concern. Let's go back to this diagram:

There could be three separate companies involved in this process:

The company that wants to use the LLM and MCP, we'll call this the User company.
The company that hosts the LLM, we'll call this the LLM company.
The company that hosts the data source, we'll call this the Data company.

The User company starts a job that uses an LLM in the LLM company. The job uses computationally (and $ costly) resources located at the Data company. Let's say something goes wrong, or the LLM misunderstands something. The LLM could make multiple expensive calls to the data source through the MCP Server, racking up large bills. Are there ways to stop this? Yes, but it takes some effort.

The other concern is a hacked remote LLM, Remember, the LLM has the keys to the kingdom for your system, so hackers really could go to town, perhaps making rogue calls to burn up expensive computing resources or even writing malicious data.

There are a number of other concerns that you can read more about here: https://www.pillar.security/blog/the-security-risks-of-model-context-protocol-mcp and here: https://community.cisco.com/t5/security-blogs/ai-model-context-protocol-mcp-and-security/ba-p/5274394

The bottom line is, if you're running something unattended, you need to put guard rails around it

Complexity - everything is a server?

As I've stated, this is a very complex beast under the hood. The LLM will run in its own process, the MCP Server will run in its own process, and maybe the underlying data sources will too (e.g. a web-based resource or a database). If any of these processes fail, then the whole system fails. If a system fails, the developers have to debug which of all these servers failed first. Inter-process communication is harder than simple procedure calls which means debugging is too.

All of the examples I've seen on the web have been relatively simple. I'm left wondering how complex it would be to develop a robust system with full debugging for something like a large-scale database. I'm not sure I want to be first to find out.

How can I get started?

The project home page: https://modelcontextprotocol.io/introduction
The Python SDK on GitHub: https://github.com/modelcontextprotocol/python-sdk

I couldn't find tutorials or articles that are good enough for me to recommend. That of itself is telling.

Where we stand today

MCP was released in November 2024 and it's still an immature standard.

Security in particular is not where it needs to be; you need to put guard rails up.
Documentation is also sorely lacking and there are very few good tutorials out there.
Debugging can be very hard, the message passing infrastructure is more difficult to work with than a simple call stack.

Sadly, the hype machine has really got going and you would think that MCP is ready for prime time and immediate deployment - it's not. This is definitely an over-hyped technology for where we are now.

Should you experiment with MCP? Only if you have a specific reason to, and then only with supervision and risk management. If you have the right use case, this is a very compelling technology with a lot of promise for the future.