Cursor’s AI Agents Now Write 35% of Its Own Production Code. Your DevOps Pipeline Is Not Ready for What Comes Next

19 min read 720 views

Sources: Cursor official blog (cursor.com/blog/agent-computer-use and cursor.com/blog/third-era), Cursor changelog (cursor.com/changelog/02-24-26), DevOps.com, NxCode, Medium/Daksh Bhardwaj, MorphLLM comparison. All statistics are from Cursor’s own published posts unless noted.

Table of Contents

Multiple server racks in a cloud data center representing Cursor Cloud Agents running in isolated virtual machines

This is not a benchmark claim. It is not a demo number or a controlled experiment result. It is a statement from Cursor’s own blog about code that is running in their live product right now: “Thirty-five percent of the PRs we merge internally at Cursor are now created by agents operating autonomously in cloud VMs.”

Cursor is a $29.3 billion company with millions of active users. Its product is the code editor that a large chunk of professional developers use every day. When 35 percent of the pull requests shipping to that product come from autonomous AI agents running on their own virtual machines in the cloud, it is a data point about the direction of the entire industry, not just one company’s internal tooling experiment.

The question worth asking is not whether this is impressive. It clearly is. The more useful question is what it actually means for the cloud and DevOps practices that organizations have been building for the last decade, and what needs to change in how software delivery is structured to account for agents as first-class contributors rather than tools that assist contributors.

What Actually Changed on February 24

Cursor has had some form of AI agent capability for a while. What changed on February 24, 2026, when the company launched Cloud Agents with Computer Use, was not the existence of agents but the environment they operate in and what they can do within it.

Before cloud agents, Cursor’s agents ran locally. They ran on your machine, competing with your other processes for CPU and memory, sharing your file system, and constrained by whatever environment happened to be set up in your development machine at that moment. You could run one, maybe two in parallel before things started conflicting. And critically, the agent could write code but had no way to actually run the software it was building and verify that it worked. It produced a diff and handed it to you. What the diff actually did was your problem to figure out.

Cloud agents eliminate both constraints. Each agent spins up in its own isolated Linux virtual machine in the cloud, with a full development environment configured for your specific repository. It does not share resources with anything else. You can run ten agents in parallel, each working on a different task in complete isolation. And because each agent has its own full environment, it can do something that local agents could not: it can actually run the software it builds, interact with it through a browser, and validate that the code works before it ever hands anything back to you.

Cursor had been running this system internally for several months before the public launch, which is where the 35 percent figure comes from. That number is not from the first week. It reflects months of production use by Cursor’s engineering team on a codebase that ships to millions of users.

The Part That Matters: Giving the Agent Its Own Machine

The framing Cursor uses in their blog is worth quoting directly: “Agents are only as capable as the environment they run in. Without the ability to use the software they are creating, agents hit a ceiling.” That sentence describes the core limitation of every AI coding tool that came before this.

A developer working on a feature does not just write code into a void. They run it. They break it. They look at what the actual running behavior is versus what they intended, adjust, and run it again. That iteration loop between writing and seeing is fundamental to how software gets built. AI agents that could only do the writing half of that loop were fundamentally limited in the kinds of tasks they could take end-to-end. They were good at generating plausible code. They were much less good at generating correct code for anything complex enough to require seeing the actual behavior.

Giving the agent its own VM with a full development environment, a browser, and the ability to navigate that browser changes the equation. The agent can now observe the behavior of the software it is building and iterate based on what it actually sees, not just on whether the code syntactically compiles and passes static tests. For UI work in particular, this is a qualitative difference in capability. An agent that can click through a UI flow, observe what breaks, and fix it based on the visual output is doing something categorically different from an agent that writes the code for the flow and trusts the tests to verify it.

Multiple terminal windows running in parallel representing Cursor cloud agents operating in isolated virtual machines

The long-running agent capability that Cursor shipped on February 12, two weeks before the cloud agent launch, adds another dimension. These agents run autonomously for 25 to 52-plus hours. In the few weeks they had been available before the cloud agent announcement, they had already produced pull requests containing over 151,000 lines of code. The combination of long-running execution and an environment where the agent can actually use the software it builds means the scope of what can be delegated has expanded substantially beyond what any previous AI coding tool could handle.

The Video Proof Problem and Why It Changes Code Review

Every cloud agent run produces a set of artifacts: video recordings, screenshots, and logs. These are attached to the pull request the agent submits. This is the change to code review that I think is going to be more significant than it sounds.

Code review in its current form requires the reviewer to read a diff and mentally simulate what the change does. For simple changes, this works fine. For anything involving UI behavior, interaction flows, timing, or any state that is not obvious from reading the code, reviewers either run the code themselves, trust the test coverage, or make their best guess. None of these options scales well as the number of pull requests increases or as the complexity of individual changes grows.

An agent-generated PR that comes with a 30-second video showing the feature being built, tested, and demonstrated working is a different kind of artifact to review. Instead of reading a diff and simulating behavior, the reviewer watches the agent’s evidence and evaluates whether it demonstrates what was asked for. That shifts what code review is: less verification of whether the code is correct, more judgment about whether the correct thing was built. Those are related but different cognitive tasks, and the second one is arguably the more important one that human reviewers should be spending their time on.

Cursor’s team described a specific example that illustrates this clearly. They kicked off a cloud agent from Slack with a security vulnerability description. The agent built a complete exploit demonstration, started a backend server, loaded the exploit page in its browser, executed the clipboard exfiltration attack against a test UUID, recorded the complete attack flow as video, and committed the demo to the repository. The summary appeared in the original Slack thread. A human security engineer could then watch the recorded attack to evaluate whether it correctly reproduced the vulnerability before deciding how to respond. That is qualitatively different from asking a human to read code and mentally reconstruct whether the exploit works.

Three Things Cursor’s Own Team Is Using Agents For Right Now

Rather than speculating about what cloud agents might eventually be used for, it is more useful to look at the four main use cases Cursor’s engineering team has actually settled on after months of internal use. These are the tasks where agents proved reliable enough to trust in production.

Feature development is the first and most obvious category. The specific example in Cursor’s blog is building source code links for the Cursor Marketplace. The agent implemented the feature end-to-end, then navigated to the imported Prisma plugin in the actual running application, clicked through each component to verify the GitHub links were working correctly, identified an issue, fixed it, and then rebased onto the main branch, resolved the merge conflicts, and squashed everything into a single clean commit. That is a complete feature development cycle, including the testing and cleanup steps that developers often rush, done autonomously.

Security vulnerability triage is the second use case, and the clipboard exfiltration example described above is representative. The pattern is: describe the vulnerability in Slack, the agent builds a proof of concept, records it, and returns evidence. The security engineer evaluates the evidence rather than building the proof of concept themselves. This compresses the triage time significantly and creates a recorded artifact that can be shared with the team and referenced later.

Full UI walkthroughs are the third use case that has seen significant adoption. Cursor assigned an agent to do a complete walkthrough of cursor.com/docs. The agent spent 45 minutes navigating every major UI element: the sidebar navigation, search functionality, copy buttons, the feedback dialog, table of contents anchors, and theme switching. It delivered a structured summary of everything tested, including anything that did not behave as expected. This is the kind of thorough QA work that often gets done incompletely by humans who are under time pressure. An agent running autonomously for 45 minutes with no time pressure and no temptation to skip the boring parts produces more complete coverage.

Tab Autocomplete Is Basically Dead Already

There is a data point in Cursor’s February blog post that deserves more attention than it has gotten. In March 2025, Cursor had approximately 2.5 times as many Tab autocomplete users as agent users. By February 2026, that ratio had completely inverted: they now have twice as many agent users as Tab users. The shift took roughly eleven months.

Cursor describes their current internal experience this way: “Most Cursor users never touch the tab key.” That is a striking statement about a feature that was, until fairly recently, the primary value proposition of AI coding tools. GitHub Copilot’s entire original premise was predictive tab completion. Cursor built much of its initial reputation on being better at tab completion than Copilot. Now the company is saying that most of their users have essentially stopped using it.

The reason is not that tab completion got worse. It is that agents got better. When an agent can handle the entire task of implementing a feature, the incremental value of having the tab key complete the next line of code becomes negligible. You are not writing the code line by line anymore. You are reviewing what the agent built. In that context, tab completion is solving a problem you no longer have.

This is the pattern that tends to happen with technology transitions: the new capability does not just add to the old one, it renders the old one irrelevant by operating at a different level of abstraction. The developers who have adopted agents most fully at Cursor are described as spending their time breaking down problems, reviewing artifacts, and giving feedback. Agents write almost 100 percent of their code. The developers are not coding anymore in the traditional sense. They are directing and evaluating.

The shift in numbers: March 2025, Cursor had 2.5x more Tab users than agent users. February 2026, that completely reversed: 2x more agent users than Tab users. In eleven months, the dominant mode of using an AI coding tool went from “complete my current line” to “go build this feature and come back with a PR.” That is not an incremental improvement. That is a different product being used in a different way.

What This Does to Your CI/CD Pipeline, Review Workflow, and Governance

This is the section that matters most for anyone building or maintaining DevOps infrastructure. The question is not whether cloud agents are impressive. The question is what happens to existing engineering processes when agents become first-class contributors to your repositories alongside humans.

Your CI/CD pipeline needs to be treated as the final word, not a rubber stamp. Cloud agents produce merge-ready PRs with video proof that the code worked in the agent’s sandbox. That is valuable evidence but it is evidence about the agent’s environment, not your production environment. Your pipeline still needs to run its own tests, apply your security scans, enforce your coding standards, and validate against your specific infrastructure configuration. Teams with strong, comprehensive CI benefit immediately because the agents produce better-quality PRs than many human contributors. Teams with weak or spotty CI will find that agents can introduce problems faster than those problems would have arrived before, because the volume of PRs increases when agents are running in parallel.

Code review needs a rethink at the process level. Human-generated PRs and agent-generated PRs are fundamentally different artifacts even when they look similar in the diff view. The agent’s PR comes with video, screenshots, and logs. The review question shifts from “did someone write this correctly?” to “did the agent understand what was being asked?” Those require different review skills and different review checklists. Organizations that do not update their review processes to account for agent-generated code will end up with processes that were designed for one kind of artifact being applied to a different kind, which produces inconsistent results.

Governance and attribution need clear policies before agents are running at scale. Who is responsible when an agent-generated PR introduces a regression? What is the audit trail for understanding why an agent made a particular architectural decision? How do you handle security reviews when the author of the code cannot be interviewed? These are not hypothetical future concerns. They are operational questions that teams deploying cloud agents at scale in 2026 are navigating right now, and the organizations that think through them before deployment go better than those that figure it out after a production incident.

From a cloud infrastructure perspective, the cost model changes when agents are active. Each cloud agent runs in its own VM, which means parallel agent runs translate directly into parallel cloud compute costs. A developer spinning up ten agents simultaneously is consuming ten VM-hours of compute for the duration of those runs. For individuals this is manageable. For teams where dozens of developers are each running multiple parallel agents, the aggregate cloud compute cost is a new line item that needs to be tracked and budgeted explicitly. FinOps tooling that was designed around human usage patterns will not automatically account for agent usage patterns without configuration.

GitHub Copilot Shipped the Same Week: How They Compare

Cursor was not the only major tool to ship cloud agent capabilities in late February. GitHub Copilot launched its coding agent for all paid users on February 26, two days after Cursor’s cloud agent release. The timing was either coincidence or competitive response. Either way, the week of February 24 to 28, 2026, will probably be looked back on as the week cloud agents went from an experimental feature at one company to a standard expectation across the industry.

Copilot’s cloud agent works differently from Cursor’s. It spins up a GitHub Actions VM, clones your repository, makes the requested changes, pushes commits to a draft PR, and iterates on CI failures. You assign tasks through GitHub issues or Copilot Chat. Since the February 26 launch, all paid users can choose Claude, Codex, or GitHub Copilot as the underlying model for their agent, and you can assign the same issue to all three simultaneously and compare the outputs. That multi-model assignment capability is genuinely useful for evaluating which model handles specific task types best for your codebase.

The key difference between the two implementations is computer use. Cursor’s agents can open a browser, navigate to your running application, interact with UI elements, and verify visually that the code they wrote does what it was supposed to do. Copilot’s agent works at the code and CI level but does not have the ability to interact with running software through a browser. For backend work and well-tested code changes, Copilot’s agent is a strong option at a lower price point. For frontend work, UI flows, and anything where visual verification of the running application matters, Cursor’s computer use capability is a meaningful differentiator. The right choice depends on what you are actually building and what your test coverage looks like.

What Agents Cannot Do Yet, and What That Means for Junior Developers

Cursor is direct about the limitations in their own blog post, which is worth noting because vendor communications about AI capabilities tend toward overselling rather than honest constraint disclosure. They write: “There is a lot of work left before this approach becomes standard in software development. At industrial scale, a flaky test or broken environment that a single developer can work around turns into a failure that interrupts every agent run.”

That is a real and important limitation. Agents are sensitive to the quality of the environment they are working in. A human developer who finds a flaky test will mentally note it, rerun, and continue. An agent that hits a flaky test has to either retry until it passes, escalate for human intervention, or produce a PR that fails CI for a reason unrelated to the code change. At the scale where multiple agents are running simultaneously, environmental quality problems that humans absorb through judgment and experience become systematic blockers that require explicit engineering attention.

Agents also perform best on well-defined, bounded tasks with clear success criteria. “Fix the login button” is too vague. “The login button on /auth/signin does not trigger form validation on mobile Safari when the form has been partially filled and then cleared” is the kind of specification that agents handle well. The difference is not a limitation of the agent’s intelligence so much as a reflection of how all task delegation works: the quality of the output is bounded by the quality of the specification. Agents make that constraint more visible because they cannot ask clarifying questions as naturally as a human colleague would.

The harder question that the developer community is actively debating is what happens to junior developers in a world where agents handle most implementation work. The traditional path into software engineering involves writing a lot of code, making a lot of mistakes, getting code reviewed, learning from the feedback, and gradually developing intuition about system design and trade-offs through accumulated experience. If agents write most of the code, junior developers have fewer opportunities to accumulate that experience. Whether new patterns of learning emerge that are as effective as the old ones, or whether the career path into senior engineering becomes significantly harder to navigate, is genuinely uncertain. The people raising this concern are not being reflexively anti-AI. They are identifying a real structural change in how the skills that define good senior engineers have historically been developed.

The Honest Assessment: Where This Leaves DevOps in 2026

Thirty-five percent is the number that anchors this story, and it is worth being precise about what it does and does not mean. It means that at one specific company, in its own internal engineering workflow, over a period of months, autonomous cloud agents created more than one-third of the pull requests that were merged into production. That company has strong CI/CD, a mature code review culture, and deep familiarity with their own agents. The 35 percent figure is not transferable to every team that buys a Cursor subscription tomorrow.

What it is is a credible proof point that autonomous agents operating in cloud environments are capable of handling a significant fraction of real production software work, not toy examples or benchmarks, but the actual code running in the product that millions of people use. That proof point matters because it removes the “this is theoretical future technology” objection from the conversation. The question is no longer whether agents can ship production code. The question is what percentage of your specific team’s work can be delegated to agents given your codebase quality, your test coverage, your task definition practices, and your review processes.

The DevOps community’s job in 2026 is to answer that question for specific organizations and to build the infrastructure that makes delegating to agents safe and auditable at scale. That means CI/CD pipelines that treat agent-generated code with the same rigor as human-generated code but are also updated to process the artifact evidence that agents produce. It means governance frameworks that assign accountability clearly when agents are contributors. It means cost models that account for agent compute usage alongside developer compute usage. And it means being honest about the tasks where agents produce reliable results and the tasks where they still require significant human direction to not introduce more problems than they solve.

Cursor saying “a year from now, we think the vast majority of development work will be done by agents” is aspirational and should be treated as such. But the gap between “35 percent today” and “majority within a year” is narrower than most organizational roadmaps have accounted for. The teams that are thinking about what their engineering processes look like when that happens are better positioned than the teams that are waiting to see if it does.

Are you already using cloud agents in your workflow, or is your team still evaluating? Specifically interested in hearing from anyone who has run into CI/CD or governance issues at scale. That is where the real operational learning is happening right now and it is less documented than the capability announcements.

References (March 18, 2026):
Cursor Cloud Agents with Computer Use announcement (February 24, 2026): cursor.com/blog/agent-computer-use
Cursor “Third Era of Coding” blog (35% figure, Tab vs agent shift): cursor.com/blog/third-era
Cursor changelog February 24, 2026: cursor.com/changelog/02-24-26
DevOps.com analysis (Mitch Ashley, The Futurum Group quote): devops.com
NxCode deep dive on Cloud Agents: nxcode.io
GitHub Copilot coding agent for all paid users (February 26, 2026): github.blog
MorphLLM Cursor vs GitHub Copilot 2026 comparison (pricing, benchmarks): morphllm.com
Medium analysis by Daksh Bhardwaj: medium.com

One year ago, the question was whether AI could write useful code.
Today the question is what percentage of your CI/CD pipeline is ready for code that AI already wrote.

Leave a Reply

Your email address will not be published. Required fields are marked *