Hidden AI Guardrails and a Wrongful Arrest Rock Tech Trust

Anthropic Caught Hiding Secret Guardrails Inside Claude Fable 5

Anthropic built a secret trap door into its most powerful publicly available AI model — and quietly used it to feed users degraded, altered responses without telling anyone. That is not a hypothetical concern or a bug report. That was company policy, at least until the backlash got loud enough to force a reversal.

Here is what actually happened. Claude Fable 5, the first widely available model from Anthropic's Mythos class of AI systems, launched with a hidden guardrail targeting something called distillation — the practice of training smaller, cheaper AI models by feeding them the outputs of larger, more expensive ones. Anthropic did not want competitors or researchers doing that with Fable. Fair enough. But instead of telling users their queries had been flagged, the company quietly changed what the model said back to them. Worse responses, no warning, no explanation.

The policy was buried inside Fable's system card, a public but deeply unsexy document that most people never read. Anthropic openly disclosed that it planned to silently degrade outputs for suspected distillation attempts — which means it was technically transparent, just in the most obscure way imaginable. The company is now admitting that was the wrong call.

After significant blowback from the AI research community, Anthropic announced it is changing course. Queries that trigger the distillation guardrail will now be rerouted to Claude Opus 4.8, the previous flagship, and users will be explicitly told when that happens. Every single time. That is a meaningful shift — it is the same approach Fable already uses for other sensitive domains like biology, chemistry, and cybersecurity, where flagged queries get bumped to Opus 4.8 rather than silently mangled.

In a statement, Anthropic offered a candid explanation for why it went invisible in the first place. Visible safeguards can be probed and gamed, so building robust ones takes real time. Invisible ones can be deployed quickly and tuned narrowly. That logic is not crazy — but the company acknowledged it chose speed over honesty and got the balance wrong.

The broader context makes this more interesting, and more complicated. Fable belongs to a class of models Anthropic had previously said were too dangerous to release publicly at all. The company spent months flagging frontier AI risks, then launched anyway with guardrails it believed were strong enough. Some of those guardrails, particularly around biology, ended up being calibrated so aggressively that even routine scientific queries were blocked — a problem Anthropic has acknowledged.

What Anthropic is really wrestling with is a tension that every frontier AI lab faces right now: how do you release genuinely powerful, genuinely risky technology responsibly without either crippling it or hiding what you are doing to limit it? Covert restrictions feel like a neat solution until users and researchers discover them. Then it just looks like deception.

Transparency is not just a PR move here. Researchers need to know what a model will and will not do to actually evaluate it. If the model is secretly substituting answers, the entire evaluation is compromised. That is a real scientific problem, not just a vibe issue — and it is why the AI community reacted so sharply.

Source: The Verge

Man Jailed After Florida Cops Trusted a 93 Percent AI Match

Police arrested Robert Dillon on one of the most serious and socially devastating charges a person can face — attempting to lure a child — based primarily on a photo taken of a computer screen showing blurry surveillance footage. A facial recognition algorithm said there was a 93 percent chance it was him. It was not.

Dillon, a 52-year-old living in Fort Myers, Florida, is now suing multiple law enforcement agencies after spending over two months under prosecution for a crime that happened more than 300 miles from his home. The charges were eventually dropped by the State Attorney's Office. But according to a lawsuit filed this week in federal court, the damage had already been done — and it did not have to happen at all.

The facial recognition match came from a system called FACES, operated by the Pinellas County Sheriff's Office and shared with partner agencies including the Jacksonville Sheriff's Office and the Jacksonville Beach Police Department. A detective flagged Dillon as the suspect from that match. The lawsuit argues that rather than using the match as a starting point for investigation, officers built a case designed to confirm it.

Here is the part that makes this especially hard to excuse: investigators ran a license plate reader database search and found zero evidence that Dillon's vehicle had ever been in the Jacksonville Beach area during the relevant time frame. That is not ambiguous. That is the kind of result that should pump the brakes on a prosecution. According to the lawsuit, it did not.

The source image itself was a photograph taken of a McDonald's surveillance monitor — not even a direct export of the surveillance footage, but a picture of a screen showing that footage. That is the kind of image quality that should make any analyst nervous about drawing confident conclusions. A 93 percent match score sounds impressive until you understand that the underlying image quality was poor enough that the algorithm was essentially doing its best guess.

Dillon is represented by the ACLU and a private law firm. The lawsuit names the City of Jacksonville Beach, two police officers, and two county sheriffs as defendants. It seeks financial damages and, notably, changes to how these agencies actually use facial recognition going forward.

This is not an isolated incident. The ACLU notes that Dillon is one of at least 15 known people to have been wrongfully arrested due to faulty facial recognition matches in the United States. The real number is almost certainly higher, because most cases never attract this kind of legal scrutiny or public attention.

The core problem is not that facial recognition technology exists. It is that some departments treat a high match score as a conclusion rather than a lead. An algorithm flagging someone is supposed to be the beginning of an investigation. When it quietly becomes the end, innocent people end up in handcuffs — accused of crimes that never happened, in places they have never been.

Source: Ars Technica

Hidden AI Guardrails and a Wrongful Arrest Rock Tech Trust

Anthropic Caught Hiding Secret Guardrails Inside Claude Fable 5

Man Jailed After Florida Cops Trusted a 93 Percent AI Match

Enjoyed this?

Don't miss the spark