Research on Defending and Exploiting LLMs via Jailbreak and Prompt-Manipulation Techniques

EVENT TIMELINE

How this story unfolded

2 events from the most recent confirmed update back to the earliest known activity.

2 EVENTS

Jan 13, 20265mo ago

SentinelOne details how modern LLM attacks exploit transformer internals

SentinelOne published an analysis explaining how attacks on large language models exploit tokenization, embeddings, context windows, and self-attention to bypass safeguards. The post described attack classes including prompt injection, jailbreaking, adversarial suffixes, and gradient-based methods such as GCG, and reviewed mitigations like randomized smoothing, suffix filtering, and adversarial training.

Researchers develop HoneyTrap to counter LLM jailbreak attacks

Researchers from Shanghai Jiao Tong University, the University of Illinois at Urbana-Champaign, and Zhejiang University proposed HoneyTrap, a multi-agent defense framework designed to deceive and mislead jailbreak attackers rather than only block requests. Reported testing across GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 showed reduced attack success rates and increased attacker effort.

LINKED ENTITIES

Related entities

Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.

6 LINKEDOpen in app

Organizations

6 linked

Zhejiang UniversityShanghai Jiao Tong UniversityUniversity of Illinois Urbana-ChampaignAnthropicOpenaiGoogle

SOURCE COVERAGE

Sources

2 references tracked. Mallory keeps watching after this page renders.

2 SOURCESView all

Cyber Security NewsNews

Jan 13, 2026

HoneyTrap - A New LLM Defense Framework to Counter Jailbreak Attacks

cybersecuritynews.com

Open source

Sentinelone LabsNews

Jan 13, 2026

Inside the LLM | Understanding AI & the Mechanics of Modern Attacks | SentinelOne

sentinelone.com

Open source

ON THE SAME THREAD

Multiple writeups describe how **LLM safety controls can be bypassed through prompt-based attacks**, arguing that jailbreaks and prompt injection are a practical security problem rather than a novelty. The reporting highlights common defense layers—training-time alignment, system prompts, input classifiers, and output filters—and says each can fail because the same model that follows instructions is also asked to interpret and enforce them. One article frames jailbreaks as an attack on the trust architecture of enterprise AI deployments, while the other demonstrates the issue through Lakera’s *Gandalf* challenge, where progressively stronger controls are still defeated by prompt manipulation. The material is **not fluff** because it provides substantive security analysis of an emerging attack class affecting AI systems. Both references focus on the same topic: how prompts can subvert LLM defenses, expose protected information, and reveal architectural weaknesses in current guardrail designs. The practical takeaway for defenders is that natural-language controls alone are brittle, especially when secrets, policy enforcement, and user-controlled input share the same inference path, making prompt injection and jailbreak resistance a core application security concern for enterprise AI deployments.

Mar 22, 2026

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications

Security researchers and vendors are warning that **prompt injection and jailbreak techniques** remain a leading risk for enterprise deployments of large language models (LLMs), enabling attackers to override system instructions, bypass safety controls, and potentially drive **data exposure** outcomes. Resecurity reports assisting a Fortune 100 organization where AI-powered banking and HR applications were targeted with prompt-injection attempts, emphasizing that these attacks exploit model behavior rather than traditional software flaws and can be used in scenarios such as extracting sensitive configuration data (for example, attempts to elicit content resembling `/etc/passwd`). Resecurity also cites OWASP’s 2025 Top 10 for LLM Applications, where prompt injection is ranked as the top issue, and frames continuous security testing (e.g., VAPT) as a key control for enterprise AI systems. Separate research highlighted by Kaspersky describes a **“poetry” jailbreak** technique in which prompts framed as rhyming verse increased the likelihood that chatbots would produce disallowed or unsafe responses; the study tested this approach across 25 models from multiple vendors (including Anthropic, OpenAI, Google, Meta, DeepSeek, and xAI). In contrast, OpenAI’s planned upgrade to *ChatGPT Temporary Chat* is primarily a product/privacy change—adding optional personalization while keeping temporary chats out of history and model training (with possible retention for up to 30 days)—and does not describe a specific security incident or vulnerability disclosure tied to prompt injection or jailbreak research.

Jun 2, 2026

Prompt Injection and Jailbreak Attacks on Large Language Models

Recent research has demonstrated that large language models (LLMs) such as GPT-5 and others are increasingly vulnerable to prompt injection and jailbreak attacks, which can be exploited to bypass built-in safety guardrails and leak sensitive information. Attackers use techniques like prompt injection—embedding malicious instructions within seemingly benign queries—to trick LLMs into revealing confidential data, including user credentials and internal documents. A notable study by Icaro Lab, in collaboration with Sapienza University and DEXAI, found that adversarial prompts written as poetry could successfully bypass safety mechanisms in 62% of tested cases across 25 frontier models, with some models exceeding a 90% success rate. These findings highlight the sophistication and creativity of new attack vectors targeting AI systems, raising significant concerns for organizations embedding LLMs into business operations. The widespread adoption of LLMs in handling sensitive business functions amplifies the risk of data exfiltration through these advanced attack methods. As organizations increasingly rely on AI for customer service, document processing, and other critical tasks, the potential for prompt injection and poetic jailbreaks to facilitate unauthorized data access becomes a pressing security issue. The research underscores the urgent need for improved AI safety measures, robust prompt filtering, and continuous monitoring to mitigate the risks posed by these evolving adversarial techniques.

Mar 21, 2026

Research on Defending and Exploiting LLMs via Jailbreak and Prompt-Manipulation Techniques

Get ahead of threats like this

How this story unfolded

SentinelOne details how modern LLM attacks exploit transformer internals

Researchers develop HoneyTrap to counter LLM jailbreak attacks

Related entities

Sources

HoneyTrap - A New LLM Defense Framework to Counter Jailbreak Attacks

Inside the LLM | Understanding AI & the Mechanics of Modern Attacks | SentinelOne

See the full picture, correlated to your attack surface.

Research on Defending and Exploiting LLMs via Jailbreak and Prompt-Manipulation Techniques

Get ahead of threats like this

How this story unfolded

SentinelOne details how modern LLM attacks exploit transformer internals

Researchers develop HoneyTrap to counter LLM jailbreak attacks

Related entities

Sources

HoneyTrap - A New LLM Defense Framework to Counter Jailbreak Attacks

Inside the LLM | Understanding AI & the Mechanics of Modern Attacks | SentinelOne

See the full picture, correlated to your attack surface.

Related stories

LLM Guardrail Bypass and Prompt Injection Weaknesses

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications

Prompt Injection and Jailbreak Attacks on Large Language Models