Summary - proceed with caution

This report gives guidance for rolling out code generation to data science teams. One size doesn't fit all, so you should use the post as a guide to shape your thinking, not as a recipe that can't be changed.

There are substantial productivity gains to be had from rolling out generative AI for code generation to data science teams, but there are major issues to be managed and overcome. Without effective leadership, including expectation setting, roll-outs will fail.

Replacing notebooks with an agentic AI like Cursor will not succeed. The most successful strategy is likely the combined use of notebooks and an agentic AI IDE which will give data scientists an understanding of the benefits of the technology and its limitations. This is in preparation for the probable appearance of agentic notebook products in the near future.

For groups that use IDEs (like software developers), I recommend immediate use of Cursor or one of its competitors. I'm covering this in a separate report.

(Perplexity.AI)

Introduction

Why, who, and how

This is a guide for rolling out generative AI (meaning code generation) for data science teams. It covers the benefits you might expect to see, the issues you'll encounter, and some suggestions for coping with them.

My comments and recommendations are based on my use of Cursor (an agentic IDE) along with Claude, Open AI and other code generation LLMs. I'm using them on multiple data science projects.

As of June 2025, there are no data science agentic AI notebooks that have reached widespread adoption, however, in my opinion, that's likely to change later on in 2025. Data science teams that understand the use of agentic AI for code generation will have an advantage over teams that do not, so early adoption is important.

Although I'm focused on data science, all my comments apply to anyone doing technical coding, by which I mean code that's algorithmically complex or uses "advanced" statistics. This can include people with the job titles "Analyst" or "Software Engineer".

I'm aware that not everyone knows what Cursor and the other agentic AI-enabled IDEs are, so I'm writing a separate blog post about them.

(Gemini)

The situation for software engineers

For more traditional software engineering roles, agentic AI IDEs offer substantial advantages and don't suffer from the "not a notebook" problem. Despite some of the limitations and drawbacks of code generation, the gains are such that I recommend an immediate managed, and thoughtful roll-out. A managed and thoughtful roll-out means setting realistic goals, having proper training, and clear communications.

Realistic goals covers productivity gains; promising productivity gains of 100% or more is unrealistic.
Proper training means educating the team on when to use code gen and when not to use it.
Clear communications means the team must be able to share their experiences and learn from one another during the roll-out phase.

I have written a separate report for software engineering deployment.

Benefits for data science

Cursor can automate a lot of the "boring" stuff that consumes data scientist's time, but isn't core algorithm development (the main thing they're paid to do). Here's a list:

Commenting code. This includes function commenting using, for example, the Google function documentation format.
Documentation. This means documenting how code works and how it's structured, e.g. create a markdown file explaining how the code base works.
Boilerplate code. This includes code like reading in data from a data source.
Test harnesses, test code, and test data. Code generation is excellent at generating regression test frameworks, including test data.
PEP8 compliance. Cursor can restructure code to meet PEP8 requirements.

There are other key advantages too:

Code completion. Given a comment or a specifc prompts, Cursor can generate code blocks, including using the correct API parameters. This means less time looking up how to use APIs.
Code generation. Cursor can generate the outline of functions and much of the functionality, but this has to be well-managed.

Overall, if used correctly, Cursor can give a significant productivity boost for data science teams.

Problems for data science

It's not plain sailing, there are several issue to overcome to get productivity benefits. You should be aware of them and have a plan to address them.

It's not a notebook

(Gemini)

On the whole, data scientists don't use IDEs, they use notebooks. Cursor, and all the other agentic IDEs, are not notebooks. This is the most important issue to deal with and it's probably going to be the biggest cause of roll-out failure.

Notebooks have features that IDEs don't, specifically the ability to do "data interactive" development and debugging; which is the key reason why data scientists use them. Unfortunately, none of the agentic AI systems have anything that comes close to a notebook's power. Cursor's debugging is not AI enabled and does not easily allow notebook cell-like data investigations.

Getting data scientists to abandon notebooks and move wholesale to an agentic IDE like Cursor is an uphill task and is unlikely to succeed.

A realistic view of code generation for data science

Complex code is not a good match

Cursor and LLMs in general, are bad at generating technically complex code, e.g. code using "advanced statistical" methods. For example, asking for code to demonstrate random variable convolution can sometimes yield weird and wrong answers. The correctness of the solution depends precisely on the prompt. It also needs the data scientist to closely review the generated code. Given that you need to know the answer and you need to experiment to get the right prompt, the productivity gain of using code generation in these cases is very low or even negative.

It's also worth pointing out that for Python code generation, code gen works very poorly for Pandas dataframe manipulation beyond simple transformations.

Code completion

Code completion is slightly different from code generation and suffers from fewer problems, but it can sometimes yield crazily wrong code.

Data scientists are not software engineers and neither is Cursor

Data scientists focus on building algorithms, not on complete systems. In my experience, data scientists are bad at structuring code (e.g. functional decomposition), a situation made worse by notebooks. Neither Cursor, nor any of its competitors or LLMs, will make up for this shortcoming.

Refactoring is risky

Sometimes, code needs to be refactored. This means changing variable names, removing unused code, structuring code better, etc. From what I've seen, asking Cursor to do this can introduce serious errors. Although refactoring can be done successfully, it needs careful and limited AI prompting.

"Accept all" will lead to failure

I'm aware of real-world cases where junior staff have blindly accepted all generated code and it hasn't ended well. Bear in mind, generated code can sometimes be very wrong. All generated code (and code completion code) must be reviewed.

Code generation roll-out recommendations

Run a pilot program first

A successful roll-out will require some experience, but where does this experience come from? There are two possibilities:

"Hidden" experience. It's likely that some staff have experimented with AI code gen, even if they're not data scientists. You can co-opt this experience.

Running a pilot program. Get a small number of staff to experiment intensively for a short period.

Where possible, I recommend a short pilot program prior to any widespread roll-out. The program should use a small number of staff and run for a month. Here are some guidelines for running a pilot program:

Goals:

To learn the strengths and weaknesses of agentic AI code generation for data science.
To learn enough to train others.
To produce a first-pass "rules of engagement".

Staff:

Use experienced/senior staff only.
Use a small team, five people or less.
If you can, use people who have experimented with Cursor and/or code generation.
Don't use skeptics or people with a negative attitude.

Communication:

Frequent staff meetings to discuss learnings. Strong meeting leadership to ensure participation and sharing.
Slack (or the equivalent) channels.

Tasks:

Find a way of using agentic IDEs (e.g. Cursor) with notebooks. This is the most important task. The project will fail if you don't get a workable answer.
Work out "rules of engagement".
Work out how to train others.

Duration

Start to end, a month.

If you don't have any in-house experience, how do you "cold start" a pilot program? Here are my suggestions:

Go to local meetup.com events and see what others are doing.
Find people who have done this elsewhere (LinkedIn!) and pay them for advice.
Watch YouTube videos (but be aware, this is low-productivity exercise).

Don't try and roll-out AI code generation blind.

Expectation setting

There are some wild claims about productivity benefits for code generation. In some cases they're true, you really can substantially reduce the time and cost of some projects. But for other projects (especially data science projects) the savings are less. Overstating the benefits has several consequences:

Loss of credibility with company leadership.
Loss of credibility with staff and harm to morale.

You need to have a realistic sense of the impact on your projects. You need to set realistic expectations right from the start.

How can you get that realistic sense? Through a pilot program.

Clear goals and measuring success

All projects need clear goals and some form of success metric. The overall goal here is to increase productivity using code generation while avoiding the implementation issues. Direct measures of success here are hard as few organizations have measures of code productivity and data science projects vary wildly in complexity. Some measures might be:

Fraction of code with all functions documented correctly.

Fraction of projects with regression tests.

High levels of staff usage of agentic AI IDEs.

The ultimate measure is of course that projects are developed faster.

At an individual level, metrics might include:

Contributions to "rules of engagement".
Contributions to Slack channel (or the equivalent).

Initial briefing and on-going communications

(Canva)

Everyone in the process must have a realistic sense of the benefits of this technology and the problems, this includes the staff doing the work, their managers, and all executive and C-level staff.

Here are my suggestions:

Written briefing on benefits and problems.
Briefing meetings for all stakeholders.
Written "rules of engagement" stating how code is to be used and not used. These rules will be updated as the project proceeds.
Regular feedback sessions for hands-on participants. These sessions are where people share their experiences.
Regular reports to executives on project progress.
On-going communications forum. This could be something like a Slack channel.
Documentation hub. This is a single known place where users can go to get relevant materials, e.g.

Set-up instructions
Cursor rules (or the equivalent)
"Rules of engagement"

Clear lines of responsibility

Assuming there are multiple people involved in an evaluation or roll-out, we need to define who does what. For this project, this means:

One person to act as the (Cursor) rules controller. The quality of generated code depends on rules, if everyone uses wildly different rules the results will inconsistent. The rules controller will provide recommended rules that everyone should use. Participants can experiment with rules, but they must keep the controller informed.
One person to act as recommendations controller. As I've explained, there are "dos" and "don'ts" for working with code generation, these are the "rules of engagement". One person should be responsible for continually keeping this up to date.

Limits on project scope

There are multiple IDEs on the market and their are multiple LLMs that will generate code. Evaluating all of them will take considerable time and be expensive. My recommendation is to choose one IDE (e.g. Cursor, Windsurf, Lovable or one of the others) and one agentic AI. It's OK to have some experimentation at the boundaries, e.g. experimenting with a different agentic AIs, but this needs to be managed - as always, project discipline is important.

Training

(Canva)

Just setting people up and telling them to get started won't work. Almost all data scientists won't be familiar with Cursor and the VS Code IDE it's based on. Cursor works differently from other IDEs, and there's little in the way of useful tutorials online. This begs the question, how do you get the expertise to train your team?

The answer is a pilot program as I've explained. This should enable you to bootstrap your initial training needs using in-house experience.

You should record the training so everyone can access it later if they run into trouble. Training must include what not to do, including pointing out failure modes (e.g. blindly accepting generated code), this is the "rules of engagement".

It may also be worth re-training people partway through the project with the knowledge gained so far.

(Don't forget, data scientists mostly don't use IDEs, so part of your training must cover basic IDE usage.)

Notebook and Cursor working together

This is the core problem for data science. Figuring out a way of using an agentic IDE and a notebook together will be challenging. Here are my recommendations.

Find a way of ensuring the agentic IDE and the notebook can use the same code file. Most notebooks can read in Python files and there are sometimes ways of preserving cell boundaries in Python (e.g. using the "# %%" format).
Edit the same Python file in Cursor and in the notebook (this may mean refreshing the notebook so it picks up any changes, Cursor seems to pick up changes by itself).
Use Cursor for comments, code completion etc. Use the notebook for live code development and debugging.

(Canva)

Precisely how to do this will depend on the exact choice of agentic IDE and notebook.

This process is awkward, but it's the best of the options right now.

(Cursor) Rules

Agentic IDEs rely on a set of rules that guide code generation. These are like settings but expressed in English prose. These rules will help govern the style of the generated code. What these rules are called will vary from IDE to IDE but in Cursor, they're called "Rules".

I suggest you start with a minimal set of Rules, perhaps 10 or so. Here are three to get you started:

"Act as an experienced data scientist creating robust, re-usable, and readable code.
Use the latest Python features, including the walrus operator. Use list comprehensions rather than loops where it makes sense.
Use meaningful variable names. Do not use df as the name of a dataframe variable."

There are several sites online that suggest Rules. Most suggest verbose and long Rules. My experience is that shorter and more concise works better.

Regression tests

As part of the development process, use Cursor to generate test cases for your code, which includes generating test data. This is one of Cursor's superpowers and one of the places where you can see big productivity improvements.

Cursor can occasionally introduce errors into existing code. Part of the "rules of engagement" must be running regression tests periodically or when the IDE has made substantial changes. In traditional development, this is expensive, but agentic IDEs substantially reduce the cost.

GitHub

Cursor integrates with GitHub and you can update Git repositories with a single prompt. However, it can occasionally mess things up. You should have a good set of tactics for GitHub integration, including having an in-house expert who can fix issues should they arise.

"Rules of engagement"

I've referred to this document a number of times. This is a written document that describes how to use code gen AI and how not to use it. Here are the kinds of things it should contain:

"Use code generation via the prompt to create function and code outlines, e.g. specifying that a file will contain 5 functions with a description of what the functions do. Most of the time, it's better to ask the agent to product code stubs. However, if a function is boilerplate, e.g. reading a CSV file into a dataframe, then you can prompt for full code generation for that function.
...
Do not use code generation or code completion for medium to complex dataframe manipulations. You can use it for simple dataframe manipulations. You can use code completion to get a hint, but you shouldn't trust it.
...
Use the prompt to comment your code, but be clear in your prompt that you want comments only and no other changes.
...

Before running regression tests, prompt the AI to comment your code.

"

You should periodically update the rules of engagement and make sure users know the rules have changed. As I stated earlier, one person should be responsible for maintaining and updating the rules of engagement.

Conclusions

Successfully rolling out agentic AI code generation to data scientists is not a trivial tasks. It will require a combination of business and technical savvy. As ever, there are political waters to navigate, both up and down the organization.

There are some, key ideas I want to reiterate:

Agentic IDEs are not notebooks. You need to find a way of working that combines notebooks and IDEs. Success depends on this.
Pilot programs will let you bootstrap a roll-out, without them, you'll find roll-outs difficult to impossible.
Training, "rules of engagement", and communication are crucial.

Other resources

I'm in the process of developing a very detailed analysis of using Cursor for data science. This analysis would form the basis of the "rules of engagement". I'm also working on a document similar to this for more traditional software engineering. If you're interested in chatting, contact me on LinkedIn: https://www.linkedin.com/in/mikewoodward/.

Engora Data Blog

Wednesday, June 4, 2025

Recommendations for rolling out generative AI to data science and technical coding teams

Summary - proceed with caution

Introduction

Why, who, and how

The situation for software engineers

Benefits for data science

Problems for data science

It's not a notebook

A realistic view of code generation for data science

Code generation roll-out recommendations

Run a pilot program first

Expectation setting

Clear goals and measuring success

Initial briefing and on-going communications

Clear lines of responsibility

Limits on project scope

Training

Notebook and Cursor working together

(Cursor) Rules

Regression tests

GitHub

"Rules of engagement"

Conclusions

Other resources

No comments:

Post a Comment