Researchers Used Poems to Trick AI Into Answering Dangerous Questions

Researchers at the AI safety startup Anthropic discovered a method that successfully bypassed the safety protocols of their large language model, Claude. The technique involved using carefully constructed prompts, including framing a dangerous query within a poem, to elicit instructions on prohibited topics.

This method, which researchers termed a ‘many-shot jailbreak,’ involved providing the AI with a long list of example question-and-answer pairs. At the end of this list, the researchers would insert a harmful question, such as one about how to synthesize napalm or build a nuclear weapon.

The ‘Many-Shot Jailbreak’ Technique

The ‘many-shot jailbreak’ works by establishing a strong contextual pattern that the AI is compelled to follow. After being shown numerous examples of a specific format, the model’s priority shifts to completing the pattern rather than adhering to its underlying safety rules. In one documented instance, the researchers used a rhyming couplet to frame the final, dangerous query.

By embedding the harmful request within this established pattern, the researchers found that the AI would overlook the forbidden nature of the topic and provide a detailed, step-by-step response. The model was successfully tricked into generating content that its safety features were specifically designed to block.

Vulnerability Discovery and Mitigation

The security flaw was discovered by Anthropic’s own internal ‘red team,’ which is tasked with finding and fixing vulnerabilities in the company’s AI systems. The research demonstrated that context-based attacks could be a significant challenge for AI safety mechanisms.

Upon identifying the vulnerability, Anthropic’s team developed and implemented a defense to prevent this specific type of jailbreak. The company later published a paper detailing both the attack method and their mitigation strategy to inform the wider AI development community about the potential risk.

Source: https://www.wired.com/story/poems-can-trick-ai-into-helping-you-make-a-nuclear-weapon/

Concise Cyber

Researchers Used Poems to Trick AI Into Answering Dangerous Questions

The ‘Many-Shot Jailbreak’ Technique

Vulnerability Discovery and Mitigation

Share this: