Anthropic has a new security system it says can stop almost all AI jailbreaks

LovabledanielsFebruary 4, 202573 Views

Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet
“Constitutional classifiers” are an attempt to teach LLMs value systems
Tests resulted in more than an 80% reduction in successful jailbreaks

In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a new concept it calls “constitutional classifiers”; a means of instilling a set of human-like values (literally, a constitution) into a large language model.

Anthropic’s Safeguards Research Team unveiled the new security measure, designed to curb jailbreaks (or achieving output that goes outside of an LLM’s established safeguards) of Claude 3.5 Sonnet, its latest and greatest large language model, in a new academic paper.

The authors found an 81.6% reduction in successful jailbreaks against its Claude model after implementing constitutional classifiers, while also finding the system has a minimal performance impact, with only “an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.”

Anthropic’s new jailbreaking defense

While LLMs can produce a staggering variety of abusive content, Anthropic (and contemporaries like OpenAI) are increasingly occupied by risks associated with chemical, biological, radiological and nuclear (CBRN) content. An example would be a LLM telling you how to make a chemical agent.

So, in a bid to prove the worth of constitutional classifiers, Anthropic has released a demo challenging users to beat 8 levels worth of CBRN-content related jailbreaking. It’s a move that has attracted criticism from those who see it as crowdsourcing its security volunteers, or ‘red teamers’.

“So you’re having the community do your work for you with no reward, so you can make more profits on closed source models?”, wrote one Twitter user.

Anthropic noted successful jailbreaks against its constitutional classifiers defense worked around those classifiers rather than explicitly circumventing them, citing two jailbreak methods in particular. There’s benign paraphrasing (the authors gave the example of changing references to the extraction of ricin, a toxin, from castor bean mash, to protein) as well as length exploitation, which amounts to confusing the LLM model with extraneous detail.

Anthropic did add jailbreaks known to work on models without constitutional classifiers (such as many-shot jailbreaking, which entails a language prompt being a supposed dialogue between the model and the user, or ‘God-mode’, in which jailbreakers use ‘l33tspeak’ to bypass a model’s guardrails) were not successful here.

However, it also admitted that prompts submitted during the constitutional classifier tests had “impractically high refusal rates”, and recognised the potential for false positives and negatives in its rubric-based testing system.

In case you missed it, another LLM model, DeepSeek R1, has arrived on the scene from China, making waves thanks to being open source and capable of running on modest hardware. The centralized web and app versions of DeepSeek have faced their own fair share of jailbreaks, including using the ‘God-mode’ technique to get around their safeguards against discussing controversial aspects of Chinese history and politics.

Weekly update

How much? A shocking number of people don’t know what Windows version they have – but this doesn’t mean they aren’t confident they could upgrade to Windows 11 right now

This PlayStation Portal accessory isn’t the newest bit of kit going, but it might just be the missing piece from my setup

Tesla is secretly testing new versions of its Model S Plaid and Model Y Performance – here’s what to expect

Weekly Newsletter

Anthropic has a new security system it says can stop almost all AI jailbreaks

Leave a comment

Leave a Reply Cancel reply

Explore more

This PlayStation Portal accessory isn’t the newest bit of kit going, but it might just be the missing piece from my setup

Tesla is secretly testing new versions of its Model S Plaid and Model Y Performance – here’s what to expect

Hackers could use smartwatches to eavesdrop on air-gapped computers via ultrasonic signals

Chrisette Michele Announces “Severe” Autism Diagnosis

Hydrogen sourcing could make or break Romania’s green steel ambitions

My hope for Micro Four Thirds is waning – OM System’s latest travel camera is yet another disappointing upgrade

‘Yes, in my back yard’—most people who live near large-scale solar projects are happy to have more built nearby

Wimbledon 2025 is set to be the smartest Championships yet, and it might help me fall in love with tennis again

Get to Know Us

Let's keep in touch