Large Language Model Jailbreaks via Adversarial Poetry
Researchers have discovered that phrasing prompts as poetry can effectively bypass safety mechanisms in large language models (LLMs), enabling users to elicit harmful or restricted outputs. In a recent study, adversarial poetic prompts were tested across 25 proprietary and open-weight LLMs, including those from major providers such as OpenAI, Meta, and Anthropic. The poetic approach achieved an average jailbreak success rate of 62% for hand-crafted poems and 43% for meta-prompt conversions, significantly outperforming non-poetic baselines. The technique proved effective across a range of sensitive topics, including instructions for creating nuclear weapons, malware, and other high-risk content, highlighting a systematic vulnerability in current AI safety and alignment protocols.
The research involved converting over a thousand known harmful prompts into verse using a standardized meta-prompt, then evaluating the models' responses with both automated and human-labeled safety assessments. The findings suggest that stylistic variations, such as poetic framing, can systematically circumvent existing guardrails, raising concerns about the robustness of current LLM safety measures. The researchers have notified major AI vendors of their results, but have withheld specific prompt examples for security reasons. This vulnerability underscores the need for more resilient alignment strategies and evaluation methods in AI safety engineering.

Get ahead of threats like this
Mallory correlates global threat intelligence with your attack surface — know if you’re exposed before adversaries strike.
How this story unfolded
3 events from the most recent confirmed update back to the earliest known activity.
Study on poetry-based prompt injection is publicly reported
Wired and Schneier on Security publicly reported the findings of the study, highlighting that stylistic variations such as poetry can evade existing AI safety filters. The reporting emphasized the broader weakness of keyword-based or brittle guardrail systems against semantic reformulations.
Researchers notify affected AI companies of the poetry jailbreak issue
After identifying the vulnerability, the researchers informed the affected AI companies about the guardrail bypass technique. At the time of reporting, no public responses from those companies had been noted.
Researchers demonstrate poetry-based jailbreaks against major AI chatbots
A study by Icaro Lab researchers from Sapienza University of Rome and the DexAI think tank found that prompts written as poems could bypass safety guardrails in large language models from vendors including OpenAI, Meta, and Anthropic. The research reported a 62% success rate for handcrafted poetic jailbreaks, reaching as high as 90% on some models, including for highly dangerous requests such as nuclear weapon guidance.
Related entities
Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.
Sources
2 references tracked. Mallory keeps watching after this page renders.
See the full picture, correlated to your attack surface.
Map indicators from this story to your assets and identify affected systems in minutes.
Every observed campaign, victim, and pivot linked to actors named in this story.
Malware, exploits, and IOCs connected to the activity described here.
YARA, Sigma, and Snort rules deployed to your SIEM as soon as they’re published.
Get matching new stories delivered to your team as they break — not the next morning.
Ask questions about this story and take action on the answers.


