Datagrom AI News Logo

Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try

Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try

February 3, 2025: Anthropic Unveils Robust AI Jailbreak Defense - Anthropic has launched constitutional classifiers for its Claude 3.5 model, which reportedly block over 95% of jailbreak attempts and minimize benign prompt rejections. These classifiers align AI actions with human values to filter out harmful content, ensuring safer AI interaction. Despite thorough testing by red teams, no universal jailbreaks were found. Common techniques included benign paraphrasing and exploiting content length.

Testing revealed a significant decrease in jailbreak success rates, dropping to 4.4% with classifiers from 86% in unprotected models. Anthropic is focused on bolstering cyber defenses, acknowledging the difficulty of achieving complete security against AI manipulation. Their efforts highlight the ongoing challenges in ensuring AI models are both effective and secure against exploitation.

Link to article Share on LinkedIn

Get the Edge in AI – Join Thousands Staying Ahead of the Curve

Weekly insights on AI trends, industry breakthroughs, and exclusive analysis from leading experts.

Only valuable weekly AI insights. No spam – unsubscribe anytime.