The Second Software Crisis
Arlo Gilbert ·
I've been reviewing more code in the last six months than in the previous five years combined. Not because our team is bigger. Not because we're shipping more features. Because the ratio changed. The amount of code being produced per engineer is up dramatically, and almost none of the systems we built to catch mistakes have scaled with it.
I don't think we're alone in this. The data that's been piling up over the last few months suggests something uncomfortable: the tools got faster, but the process didn't follow, and the gap between those two things is where the bugs live.
The numbers
CodeRabbit published a report in December analyzing 470 open-source GitHub pull requests. AI-generated code produced 1.7 times more issues than human-written code. Not marginal stuff. Logic and correctness errors were up 75%. Security vulnerabilities appeared at 1.5 to 2 times the rate. Performance inefficiencies showed up nearly eight times more often.
Across the broader dataset, pull requests per author went up 20% year over year. Incidents per pull request went up 23.5%. More code, more problems, in almost perfect proportion.
Then there's the METR study. This one stung. METR ran a randomized controlled trial with 16 experienced open-source developers. These were people working on tasks in codebases they'd contributed to for an average of five years. Before starting, the developers predicted AI tools would make them 24% faster. After finishing, they estimated they'd been about 20% faster.
The actual measurement: 19% slower.
Not beginners. Not toy projects. Experienced developers on their own codebases, using Cursor Pro and Claude 3.5 Sonnet, ended up taking longer with AI tools than without them. The follow-up study in early 2026 with a larger cohort narrowed the gap to roughly a 4% slowdown. But even the updated numbers don't show a statistically significant speedup. The perceived gains are real. The measured gains aren't there yet for complex work in mature systems.
PR sizes, meanwhile, have ballooned 154%. Code review time has roughly tripled. Gartner is projecting a 2,500% increase in AI-related software defects and says 75% of technology leaders will face moderate to severe technical debt from AI-generated code by the end of this year.
Where the bottleneck moved
The story being told in most boardrooms is simple: AI coding tools make developers 40% more productive. Ship more, hire less, margins improve. It's a clean story. It also misses what actually happened.
AI tools shifted the bottleneck. The constraint used to be writing code. Now it's reviewing code. And the uncomfortable truth is that most engineering organizations are still staffed, structured, and incentivized around the old bottleneck.
We built decades of process around the assumption that writing code is slow and expensive. Code review, testing, CI/CD, design reviews, architecture discussions. All of it exists because writing code was the hard part, and you wanted to make sure the slow, expensive thing was done right.
When writing code got fast and cheap, the review layer just got buried. A senior engineer who used to review three pull requests a day is now looking at eight, each one 154% larger. The quality of those reviews is suffering, because human attention doesn't scale the way token generation does.
This is the part I keep coming back to as someone who builds AI products for a living. The problem isn't that AI writes bad code. Sometimes it does, sometimes it doesn't. The problem is that we have no scalable process for verifying the output. We're generating code at machine speed and reviewing it at human speed. The math doesn't work.
Garmisch, 1968
In October 1968, NATO convened about 50 computer scientists in Garmisch, Germany. The topic was a problem that didn't have a name yet. Computers had become powerful enough that people were trying to build systems far beyond their ability to manage. Projects were failing, budgets were exploding, and the software being delivered was brittle and late.
A German computer scientist named Fritz Bauer proposed the title for the conference: "Software Engineering." It was deliberately provocative. Everyone in the room knew there was no such discipline. The name described what they needed to invent, not what already existed.
The attendees gave the underlying problem a name too: the software crisis. The crisis wasn't the hardware. The hardware was fine. The crisis was that the tools for building software had outrun the processes for managing it. Writing code was finally easy enough that you could attempt things you couldn't finish correctly.
They spent the next two decades inventing the answer. Structured programming. Version control. Formal code review. Testing frameworks. Design patterns. The entire discipline of software engineering grew out of that conference, built to close the gap between what we could produce and what we could verify.
What's familiar
I'm not trying to be dramatic about this. But the structural parallel is hard to ignore.
In 1968, the problem was that computers got powerful enough to attempt large systems, and we had no process for managing the complexity. In 2026, AI tools got powerful enough to generate code at volume, and we have no process for reviewing it at that volume.
The first software crisis gave us code review as a discipline. This one might need to give us something for AI-generated code specifically. Not just better linters or smarter test suites, though those help. Something more fundamental about how we verify software that's being produced faster than any human team can read it.
Part of the answer is AI reviewing AI. Automated code review tools are getting better. But anyone who's used them knows they catch syntax and pattern violations, not architectural mistakes or subtle business logic errors. The hard stuff still needs a human, and the human is now the bottleneck.
The organizational piece matters too. If your eng team's performance metric is still "PRs merged per sprint," you're incentivizing exactly the wrong thing. You're measuring the part that got cheap and ignoring the part that got expensive.
But honestly, before any of that, there's a conversation most teams haven't had. The 40% productivity gain is real for a specific definition of productivity: lines of code generated per hour. By that metric, every developer on your team is a superhero. By the metric that actually matters (working software shipped without breaking things), the picture is murkier.
The name we need
The attendees at Garmisch in 1968 did something smart. Before they tried to solve the software crisis, they named it. Naming a problem makes it legible. It turns a vague anxiety into something you can staff, budget, and build process around.
We don't have a name yet for what's happening in 2026. "Quality collapse" is close but sounds alarmist. "AI code debt" is accurate but bland. Whatever we end up calling it, the shape of the problem is clear. Our tools for producing software have, once again, outrun our tools for verifying it. The last time this happened, it took a conference in the Bavarian Alps and twenty years of work to close the gap.
I'd like to think we can move faster this time. We have better infrastructure, smarter tooling, and the advantage of knowing that this pattern has a precedent. But the first step is the same as it was in 1968. Admit that the speed isn't free, and start building the process to pay for it.