Tech

Anthropic has a new security system it says can stop almost all AI jailbreaks

Share
Share


  • Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet
  • “Constitutional classifiers” are an attempt to teach LLMs value systems
  • Tests resulted in more than an 80% reduction in successful jailbreaks

In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a new concept it calls “constitutional classifiers”; a means of instilling a set of human-like values (literally, a constitution) into a large language model.

Anthropic’s Safeguards Research Team unveiled the new security measure, designed to curb jailbreaks (or achieving output that goes outside of an LLM’s established safeguards) of Claude 3.5 Sonnet, its latest and greatest large language model, in a new academic paper.

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Microsoft is making all new accounts passwordless by default
Tech

Microsoft is making all new accounts passwordless by default

New Microsoft accounts will use passkeys by default, company reveals Existing users...