Anthropic has a new security system it says can stop almost all AI jailbreaks

Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet
“Constitutional classifiers” are an attempt to teach LLMs value systems
Tests resulted in more than an 80% reduction in successful jailbreaks

In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a new concept it calls “constitutional classifiers”; a means of instilling a set of human-like values (literally, a constitution) into a large language model.

Anthropic’s Safeguards Research Team unveiled the new security measure, designed to curb jailbreaks (or achieving output that goes outside of an LLM’s established safeguards) of Claude 3.5 Sonnet, its latest and greatest large language model, in a new academic paper.

The authors found an 81.6% reduction in successful jailbreaks against its Claude model after implementing constitutional classifiers, while also finding the system has a minimal performance impact, with only “an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.”

Anthropic’s new jailbreaking defense

While LLMs can produce a staggering variety of abusive content, Anthropic (and contemporaries like OpenAI) are increasingly occupied by risks associated with chemical, biological, radiological and nuclear (CBRN) content. An example would be a LLM telling you how to make a chemical agent.

So, in a bid to prove the worth of constitutional classifiers, Anthropic has released a demo challenging users to beat 8 levels worth of CBRN-content related jailbreaking. It’s a move that has attracted criticism from those who see it as crowdsourcing its security volunteers, or ‘red teamers’.

“So you’re having the community do your work for you with no reward, so you can make more profits on closed source models?”, wrote one Twitter user.

Anthropic noted successful jailbreaks against its constitutional classifiers defense worked around those classifiers rather than explicitly circumventing them, citing two jailbreak methods in particular. There’s benign paraphrasing (the authors gave the example of changing references to the extraction of ricin, a toxin, from castor bean mash, to protein) as well as length exploitation, which amounts to confusing the LLM model with extraneous detail.

Anthropic did add jailbreaks known to work on models without constitutional classifiers (such as many-shot jailbreaking, which entails a language prompt being a supposed dialogue between the model and the user, or ‘God-mode’, in which jailbreakers use ‘l33tspeak’ to bypass a model’s guardrails) were not successful here.

However, it also admitted that prompts submitted during the constitutional classifier tests had “impractically high refusal rates”, and recognised the potential for false positives and negatives in its rubric-based testing system.

In case you missed it, another LLM model, DeepSeek R1, has arrived on the scene from China, making waves thanks to being open source and capable of running on modest hardware. The centralized web and app versions of DeepSeek have faced their own fair share of jailbreaks, including using the ‘God-mode’ technique to get around their safeguards against discussing controversial aspects of Chinese history and politics.