Anthropic: 'We made the wrong tradeoff' in new model guardrails

Anthropic just flip-flopped on a policy that was silently limiting what some AI researchers could do with its new Claude Fable 5 model.

Earlier this week, the AI lab released Claude Fable 5, a public version of the Mythos model designed with extra safety measures to prevent misuse.

At release, Anthropic said that it took precautions like rerouting questions about cybersecurity, biology, and chemistry to less capable models to ensure people cannot use the advanced model to plan cyberattacks or build a bioweapon.

The lab also said that for those trying to use Fable 5 for AI development, the company would degrade the model’s performance without explaining the change to the user. Some in the developer community saw the move as a quiet way to prevent others from creating rival AI systems, Business Insider previously reported.

Wednesday’s statement reverses that move: Fable 5 will now tell users whether their prompt is being refused or rerouted.

“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” an Anthropic spokesperson said in a statement to Business Insider on Wednesday. “Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal.”

The company added, “We made the wrong tradeoff, and we apologize for not getting the balance right.”

Anthropic said its set of safeguards is in place to address national security issues, so that “foreign adversaries” cannot get ahead in developing frontier chips and large language models. It added that a vast majority of coding and machine learning work is unaffected by these safeguards.

Read more about Mythos

What smart people are saying about the 2 most controversial parts of Anthropic’s new models

Why Anthropic’s ‘safe’ Mythos-class model won’t answer questions about cancer

Alistair Barr

Anthropic purposely made its new Mythos-based models bad at AI research, and developers are fuming

Anthropic releases Claude Fable 5, a ‘Mythos-class’ AI model with safeguards

Announced in April, Anthropic’s Mythos is considered one of the most powerful AI systems ever developed, with governments and security agencies warning that its capabilities surpass current public models in advanced reasoning, cybersecurity, and scientific research.

Researchers and Anthropic itself have flagged that Mythos may be used to accelerate cyberattacks, aid biological or chemical weapons research, and give foreign actors a dangerous new tool.The model is being released to a limited group of government and other approved users, not to the public.