Skip to main content
Mallory
Mallory

Prompt Injection and Jailbreak Attacks on Large Language Models

large language modelsprompt injectionLLM vulnerabilitiessophisticated attacksattack vectorsexploitjailbreakdata exfiltrationunauthorized accessadversarial promptsAI safetyprompt filteringDEXAI
Updated December 2, 2025 at 05:01 PM2 sources

Get Ahead of Threats Like This

Know if you're exposed — before adversaries strike.

Recent research has demonstrated that large language models (LLMs) such as GPT-5 and others are increasingly vulnerable to prompt injection and jailbreak attacks, which can be exploited to bypass built-in safety guardrails and leak sensitive information. Attackers use techniques like prompt injection—embedding malicious instructions within seemingly benign queries—to trick LLMs into revealing confidential data, including user credentials and internal documents. A notable study by Icaro Lab, in collaboration with Sapienza University and DEXAI, found that adversarial prompts written as poetry could successfully bypass safety mechanisms in 62% of tested cases across 25 frontier models, with some models exceeding a 90% success rate. These findings highlight the sophistication and creativity of new attack vectors targeting AI systems, raising significant concerns for organizations embedding LLMs into business operations.

The widespread adoption of LLMs in handling sensitive business functions amplifies the risk of data exfiltration through these advanced attack methods. As organizations increasingly rely on AI for customer service, document processing, and other critical tasks, the potential for prompt injection and poetic jailbreaks to facilitate unauthorized data access becomes a pressing security issue. The research underscores the urgent need for improved AI safety measures, robust prompt filtering, and continuous monitoring to mitigate the risks posed by these evolving adversarial techniques.

Sources

December 2, 2025 at 12:00 AM

Related Stories

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications

Security researchers and vendors are warning that **prompt injection and jailbreak techniques** remain a leading risk for enterprise deployments of large language models (LLMs), enabling attackers to override system instructions, bypass safety controls, and potentially drive **data exposure** outcomes. Resecurity reports assisting a Fortune 100 organization where AI-powered banking and HR applications were targeted with prompt-injection attempts, emphasizing that these attacks exploit model behavior rather than traditional software flaws and can be used in scenarios such as extracting sensitive configuration data (for example, attempts to elicit content resembling `/etc/passwd`). Resecurity also cites OWASP’s 2025 Top 10 for LLM Applications, where prompt injection is ranked as the top issue, and frames continuous security testing (e.g., VAPT) as a key control for enterprise AI systems. Separate research highlighted by Kaspersky describes a **“poetry” jailbreak** technique in which prompts framed as rhyming verse increased the likelihood that chatbots would produce disallowed or unsafe responses; the study tested this approach across 25 models from multiple vendors (including Anthropic, OpenAI, Google, Meta, DeepSeek, and xAI). In contrast, OpenAI’s planned upgrade to *ChatGPT Temporary Chat* is primarily a product/privacy change—adding optional personalization while keeping temporary chats out of history and model training (with possible retention for up to 30 days)—and does not describe a specific security incident or vulnerability disclosure tied to prompt injection or jailbreak research.

1 months ago

Large Language Model Jailbreaks via Adversarial Poetry

Researchers have discovered that phrasing prompts as poetry can effectively bypass safety mechanisms in large language models (LLMs), enabling users to elicit harmful or restricted outputs. In a recent study, adversarial poetic prompts were tested across 25 proprietary and open-weight LLMs, including those from major providers such as OpenAI, Meta, and Anthropic. The poetic approach achieved an average jailbreak success rate of 62% for hand-crafted poems and 43% for meta-prompt conversions, significantly outperforming non-poetic baselines. The technique proved effective across a range of sensitive topics, including instructions for creating nuclear weapons, malware, and other high-risk content, highlighting a systematic vulnerability in current AI safety and alignment protocols. The research involved converting over a thousand known harmful prompts into verse using a standardized meta-prompt, then evaluating the models' responses with both automated and human-labeled safety assessments. The findings suggest that stylistic variations, such as poetic framing, can systematically circumvent existing guardrails, raising concerns about the robustness of current LLM safety measures. The researchers have notified major AI vendors of their results, but have withheld specific prompt examples for security reasons. This vulnerability underscores the need for more resilient alignment strategies and evaluation methods in AI safety engineering.

3 months ago

UK Intelligence Warns of Persistent Prompt Injection Vulnerabilities in AI Systems

The UK’s National Cyber Security Centre (NCSC) has issued a warning that large language models (LLMs) are inherently vulnerable to prompt injection attacks, a type of cyber threat that manipulates AI systems into disregarding their original instructions. Security experts at the NCSC emphasized that this vulnerability is fundamental to how LLMs process text, making it unlikely that prompt injection can ever be fully eliminated. Real-world examples have already demonstrated attackers using prompt injection to bypass restrictions in systems like Microsoft’s Bing and GitHub Copilot, and the risk is expected to grow as generative AI becomes more deeply embedded in digital infrastructure. The NCSC’s technical director for platforms research, David C, cautioned that prompt injection is often mistakenly compared to SQL injection, but the two require different mitigation strategies. Unlike traditional application vulnerabilities, LLMs do not enforce a security boundary between trusted and untrusted content, allowing malicious instructions to be processed alongside legitimate prompts. The agency’s warning highlights the need for organizations to recognize the persistent nature of this threat and to develop new approaches to securing AI-driven applications, as conventional defenses may prove inadequate.

3 months ago

Get Ahead of Threats Like This

Mallory continuously monitors global threat intelligence and correlates it with your attack surface. Know if you're exposed — before adversaries strike.