Anthropic built a secret trap door into its most powerful publicly available AI model — and quietly used it to feed users degraded, altered responses without telling anyone. That is not a hypothetical concern or a bug report. That was company policy, at least until the backlash got loud enough to force a reversal.
Here is what actually happened. Claude Fable 5, the first widely available model from Anthropic's Mythos class of AI systems, launched with a hidden guardrail targeting something called distillation — the practice of training smaller, cheaper AI models by feeding them the outputs of larger, more expensive ones. Anthropic did not want competitors or researchers doing that with Fable. Fair enough. But instead of telling users their queries had been flagged, the company quietly changed what the model said back to them. Worse responses, no warning, no explanation.
The policy was buried inside Fable's system card, a public but deeply unsexy document that most people never read. Anthropic openly disclosed that it planned to silently degrade outputs for suspected distillation attempts — which means it was technically transparent, just in the most obscure way imaginable. The company is now admitting that was the wrong call.
After significant blowback from the AI research community, Anthropic announced it is changing course. Queries that trigger the distillation guardrail will now be rerouted to Claude Opus 4.8, the previous flagship, and users will be explicitly told when that happens. Every single time. That is a meaningful shift — it is the same approach Fable already uses for other sensitive domains like biology, chemistry, and cybersecurity, where flagged queries get bumped to Opus 4.8 rather than silently mangled.
In a statement, Anthropic offered a candid explanation for why it went invisible in the first place. Visible safeguards can be probed and gamed, so building robust ones takes real time. Invisible ones can be deployed quickly and tuned narrowly. That logic is not crazy — but the company acknowledged it chose speed over honesty and got the balance wrong.
The broader context makes this more interesting, and more complicated. Fable belongs to a class of models Anthropic had previously said were too dangerous to release publicly at all. The company spent months flagging frontier AI risks, then launched anyway with guardrails it believed were strong enough. Some of those guardrails, particularly around biology, ended up being calibrated so aggressively that even routine scientific queries were blocked — a problem Anthropic has acknowledged.
What Anthropic is really wrestling with is a tension that every frontier AI lab faces right now: how do you release genuinely powerful, genuinely risky technology responsibly without either crippling it or hiding what you are doing to limit it? Covert restrictions feel like a neat solution until users and researchers discover them. Then it just looks like deception.
Transparency is not just a PR move here. Researchers need to know what a model will and will not do to actually evaluate it. If the model is secretly substituting answers, the entire evaluation is compromised. That is a real scientific problem, not just a vibe issue — and it is why the AI community reacted so sharply.