The Model Didn't Change

The Model Didn't Change

Arlo Gilbert ·

This morning Anthropic released Claude Opus 4.7. It's faster. It scores higher on coding benchmarks. SWE-bench Pro went from 53.4% to 64.3%. It processes images at triple the resolution of its predecessor. There's a new "xhigh" effort level for people who want more reasoning depth without going to max. The reviews will be positive. By most measures, it's a genuine step forward.

It's also arriving after six weeks of users asking what happened to the last model.

February and March

On February 9, Anthropic shipped a feature called adaptive thinking. Instead of a fixed reasoning budget, Claude would decide on its own how much to think about each response. On March 3, they changed the default effort level from high to medium. Both changes appeared in changelogs.

Together, those two updates transformed what Claude's heaviest users experienced. Not the model weights. Not the training data. Not the architecture. The defaults.

On April 2, Stella Laurenzo filed a GitHub issue. Laurenzo is Senior Director of AI at AMD, leading the compiler team that builds open-source AI infrastructure (IREE, Torch-MLIR). Previously VP at Nod.ai before AMD acquired it, and before that a director at Google. Not a casual user. She brought 6,852 Claude Code sessions, 17,871 thinking blocks, and 234,760 tool calls spanning three months of stable internal engineering work.

Her findings were specific. Median visible thinking depth dropped from roughly 2,200 characters in January to 600 by March. A 73% collapse. The ratio of code reads to code edits fell from 6.6 to 2.0, meaning Claude was reading three times less before making changes. "Should I continue?" bail-outs appeared 173 times in 17 days. Before March 8, the count was zero. Her team's estimated monthly API costs went from $345 to $42,121. The model was burning tokens on restarts and rework instead of getting it right the first time. Her conclusion: Claude Code "cannot be trusted for complex engineering work."

The defense

Anthropic's response was technically precise. They pointed to the two configuration changes. They noted that the effort level had been lowered in response to user feedback about excessive token consumption. They suggested /effort high or /effort max as workarounds.

They were careful to say: we didn't change the model.

This is accurate. The model weights for Opus 4.6 didn't change. What changed was how much the model was allowed to think before answering. At medium effort, Claude sometimes allocated zero reasoning tokens to tasks it judged simple. Zero tokens of reasoning. Boris Cherny documented cases where this produced precise hallucinations: fabricated commit SHAs, nonexistent API versions, apt packages that have never existed. The model didn't degrade. It just stopped thinking.

The BridgeBench leaderboard retested Opus 4.6 and claimed accuracy dropped from 83.3% to 68.3%. Critics called the methodology flawed, and they had a point. But the volume of complaints kept growing across GitHub, Reddit, and X regardless. Marginlab's independent evaluations showed a baseline pass rate slipping from 56% to 50% by April 10. The benchmarks were noisy. The user frustration was not.

The lights flickering

While this played out on social media and in GitHub threads, Anthropic's infrastructure was having its own problems. Outages hit on April 3, 6, 7, 8, 13, and 15. The April 15 incident took down Claude.ai, the API, and Claude Code simultaneously. First confirmed at 10:53 AM ET, briefly recovered, went back down at 11:40 AM, and wasn't fully resolved until 1:42 PM. March 2026 alone had 14 product releases and 5 outages.

For users already questioning whether their model had been degraded, repeated service failures amplified the doubt. When the model feels different and the service keeps going down, the two problems blur into one. People stop parsing whether this is a configuration issue or a capacity issue or a model issue. They just know the thing they're paying for isn't working the way it did last month.

The standard that doesn't exist

There's a useful comparison from an older industry. When a pharmaceutical company wants to sell a generic version of a brand-name drug, the FDA requires bioequivalence testing. You prove that the generic produces the same clinical outcome as the original. Not similar. Equivalent. The statistical framework is specific: the 90% confidence interval for key pharmacokinetic parameters must fall within 80-125% of the original. An entire regulatory process exists around one idea: when you change how a product is delivered, you owe the user proof that the outcome doesn't change.

AI has nothing like this. Anthropic changed the reasoning depth, the effort default, and the visibility of thinking tokens across a single week in March. Each change was arguably defensible on its own. Together, they produced something that felt like a different product to the people who used it hardest. The disclosure was a changelog entry among fourteen others.

No standard requires an AI company to prove that a configuration change preserves the user experience. No bioequivalence test for model defaults. The expectation from Anthropic was that users would read the changelog, understand the implications, and adjust their settings. The reality is that an AMD director with 17,871 thinking blocks of data had to reverse-engineer what happened.

What 4.7 fixes

Opus 4.7 looks good. The coding improvements are real. Multi-step task performance is up 14% over 4.6 with fewer tool errors. The new xhigh effort level gives users more control over the reasoning-speed tradeoff. Vision capabilities tripled. These aren't marketing numbers. The benchmark gains are meaningful.

But 4.7 doesn't address the thing that made 4.6 a trust problem. The issue was never that the model was bad. The experience changed without adequate notice, the disclosure was buried, and when users pushed back, the first response was that they should adjust their settings. Shipping a better model is the right move. It doesn't establish a standard for how future changes get communicated.

The next time any AI company optimizes for latency or cost by adjusting reasoning defaults, the same cycle will play out unless there's a standard in place that prevents it. And right now there isn't one.

Writing this on the product in question

I should be direct about something. I'm writing this post with Opus 4.7. The product I'm critiquing is the product helping to produce these words. I chose to write about Anthropic's transparency record using Anthropic's tool. You can read that as ironic or as the whole point.

I use Claude every day. We use it at Osano. I depend on it for real work. When something changes about how it performs, I need to know clearly, in advance, with enough context to assess the impact. Not from a changelog I have to go looking for. Not from a workaround posted in a GitHub thread after the community reverse-engineers the problem.

This is the same standard we advocate for in privacy. When a company changes how it processes your data, you deserve clear notice. When a company changes how its AI reasons through your problems, you deserve the same.

Opus 4.7 is probably a good model. I'll know as I use it. The question that sits underneath the benchmarks is simpler: six months from now, will I be able to tell if it changed?

Back to Words