Researchers Uncover ‘Many-Shot Jailbreaking’ Flaw
Security researchers from Contextual AI have discovered and reported a significant vulnerability in Anthropic’s Claude 3 Sonnet AI model. The technique, named many-shot jailbreaking, allowed the researchers to bypass the AI’s safety filters and exfiltrate sensitive information. The attack involved feeding the model a single, lengthy prompt that contained hundreds of examples of safe, compliant question-and-answer pairs. This process effectively overwhelmed the model’s built-in safety mechanisms.
In their demonstration, the researchers provided the AI with a document containing fictional personally identifiable information (PII), including a name, email address, and Social Security Number. After priming the model with the extensive list of safe examples, a malicious instruction was placed at the very end of the prompt. This final instruction requested the AI to locate the private data within the document and encode it.
Politeness as an Attack Vector
A notable finding from the research was the increased effectiveness of the attack when the malicious request was phrased politely. By using kind and gentle language, such as ‘please’ and ‘thank you,’ the researchers observed a higher success rate in compelling the AI to comply. The final prompt successfully instructed Claude 3 Sonnet to find the character’s SSN and email address and then encode them into a Base64 string, which the model proceeded to do.
The underlying principle of the attack is that the large volume of legitimate examples in the prompt’s context window causes the final malicious request to be treated as just another compliant task, thereby bypassing the model’s safety alignment. Contextual AI responsibly disclosed the vulnerability to Anthropic. In response, Anthropic acknowledged the issue, stated they have implemented a number of mitigations, and confirmed they are working to make their models more resilient to such jailbreaking methods. The vulnerability is now considered largely patched.