LLM Guardrail Bypass and Prompt Injection Weaknesses

EVENT TIMELINE

How this story unfolded

4 events from the most recent confirmed update back to the earliest known activity.

4 EVENTS

Mar 22, 20263mo ago

Article frames system prompt leakage as a distinct enterprise AI security risk

A CyberThrone article argued that system prompt leakage is a fundamental weakness in enterprise AI because hidden instructions can be extracted from the shared model context, exposing proprietary logic, guardrails, and even embedded credentials. It cited examples and research including Bing Chat 'Sydney,' the PLeak prompt-extraction framework, and OWASP’s LLM07:2025 classification, while recommending that organizations avoid storing secrets in prompts and externalize controls.

The Script Behind the Stage: Prompt Leaking and the Secrets Your AI Holds - TheCyberThrone

Mar 17, 20263mo ago

Enterprise jailbreak risks framed as a security governance issue

A CyberThrone article described LLM jailbreaking as an enterprise security threat that can enable data exfiltration, unauthorized actions, and insider abuse in Copilot and agentic AI deployments. It outlined common jailbreak techniques and recommended defense-in-depth measures such as semantic classifiers, context monitoring, output validation, and least-privilege access.

Mar 16, 20263mo ago

Analysis details structural failures in layered LLM defenses

An InfoSec Write-ups article analyzed Gandalf across eight levels and concluded that prompt-based defenses fail structurally because attackers can re-express the same malicious intent in new linguistic forms. The piece highlighted bypass methods including format manipulation, base64 input, deception probing, indirect extraction, and semantic reframing.

Lakera launches Gandalf prompt-injection challenge

Lakera’s Gandalf challenge was made available as a practical environment for testing prompt-injection and jailbreak techniques against layered LLM defenses. The challenge’s multi-level structure demonstrated how protections such as system prompts, filters, and judge models could be bypassed.

LINKED ENTITIES

Related entities

Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.

13 LINKEDOpen in app

Threat actors

1 linked

AridViper

Affected products

3 linked

ChatgptMicrosoft 365 CopilotFacebook

Organizations

9 linked

OpenaiMicrosoft CorporationSamsung ElectronicsLayerXxAIGooglePoeLakeraMedium

SOURCE COVERAGE

Sources

3 references tracked. Mallory keeps watching after this page renders.

3 SOURCESView all

CyberthroneNews

Mar 22, 2026

The Script Behind the Stage: Prompt Leaking and the Secrets Your AI Holds - TheCyberThrone

thecyberthrone.in

Open source

CyberthroneNews

Mar 17, 2026

Politely Ask Your AI to Misbehave - It will Jailbreak the GuardRail - TheCyberThrone

thecyberthrone.in

Open source

Infosec WriteupsNews

Mar 16, 2026

How Prompts Break Systems: A Practical Analysis of LLM Defense Architecture | by Irem Bezci | Mar, 2026 | InfoSec Write-ups

infosecwriteups.com

Open source

ON THE SAME THREAD

Security researchers and commentators warned that attacks on **LLM-based systems** are evolving beyond simple “prompt injection” into a broader execution mechanism dubbed **promptware**, with a proposed seven-step **promptware kill chain** to describe how malicious instructions enter and propagate through AI-enabled applications. The core risk highlighted is architectural: LLMs treat system instructions, user input, and retrieved content as a single token stream, enabling **indirect prompt injection** where hostile instructions are embedded in external data sources (web pages, emails, shared documents) that an LLM ingests at inference time; the attack surface expands further as models become **multimodal**, allowing instructions to be hidden in images or audio. Related academic work demonstrated a concrete multimodal variant against **embodied AI** using large vision-language models: **CHAI (Command Hijacking Against Embodied AI)**, which embeds deceptive natural-language instructions into visual inputs (e.g., road signs) to influence agent behavior in scenarios including drone emergency landing, autonomous driving, and object tracking, reportedly outperforming prior attacks in evaluations. Separately, reporting on a viral “AI caricature” social-media trend framed the risk as downstream **social engineering** and potential **LLM account takeover** leading to exposure of prompt histories and employer-sensitive data; while largely hypothetical, it underscores how widespread consumer LLM use and public oversharing can increase the likelihood and impact of prompt-driven compromise paths.

Mar 21, 2026

Research on Defending and Exploiting LLMs via Jailbreak and Prompt-Manipulation Techniques

Recent research highlights how **LLM jailbreak and prompt-manipulation attacks** can bypass safety controls, especially in *multi-turn* conversations where adversaries gradually escalate requests to elicit harmful or policy-violating output. A proposed defense framework, **HoneyTrap**, aims to counter these attacks with a *multi-agent* approach that goes beyond static filtering or supervised fine-tuning by using **adaptive, deceptive responses** intended to slow attackers and deny actionable information rather than simply refusing requests. Separately, technical analysis of the **LLM input-processing pipeline** (tokenization, embeddings, attention, and context-window behavior) explains why common guardrails like keyword filters can fail and how attackers can exploit architectural properties (including **Query-Key-Value attention dynamics**) to steer model behavior. The research describes common offensive techniques—**prompt injection, jailbreaking, and adversarial suffixes**—and frames them as practical risks for enterprise deployments, particularly **public-facing chatbots** and other systems where organizations cannot fully control user input.

Mar 21, 2026

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications

Security researchers and vendors are warning that **prompt injection and jailbreak techniques** remain a leading risk for enterprise deployments of large language models (LLMs), enabling attackers to override system instructions, bypass safety controls, and potentially drive **data exposure** outcomes. Resecurity reports assisting a Fortune 100 organization where AI-powered banking and HR applications were targeted with prompt-injection attempts, emphasizing that these attacks exploit model behavior rather than traditional software flaws and can be used in scenarios such as extracting sensitive configuration data (for example, attempts to elicit content resembling `/etc/passwd`). Resecurity also cites OWASP’s 2025 Top 10 for LLM Applications, where prompt injection is ranked as the top issue, and frames continuous security testing (e.g., VAPT) as a key control for enterprise AI systems. Separate research highlighted by Kaspersky describes a **“poetry” jailbreak** technique in which prompts framed as rhyming verse increased the likelihood that chatbots would produce disallowed or unsafe responses; the study tested this approach across 25 models from multiple vendors (including Anthropic, OpenAI, Google, Meta, DeepSeek, and xAI). In contrast, OpenAI’s planned upgrade to *ChatGPT Temporary Chat* is primarily a product/privacy change—adding optional personalization while keeping temporary chats out of history and model training (with possible retention for up to 30 days)—and does not describe a specific security incident or vulnerability disclosure tied to prompt injection or jailbreak research.

Jun 2, 2026

LLM Guardrail Bypass and Prompt Injection Weaknesses

Get ahead of threats like this

How this story unfolded

Article frames system prompt leakage as a distinct enterprise AI security risk

Enterprise jailbreak risks framed as a security governance issue

Analysis details structural failures in layered LLM defenses

Lakera launches Gandalf prompt-injection challenge

Related entities

Sources

The Script Behind the Stage: Prompt Leaking and the Secrets Your AI Holds - TheCyberThrone

Politely Ask Your AI to Misbehave - It will Jailbreak the GuardRail - TheCyberThrone

How Prompts Break Systems: A Practical Analysis of LLM Defense Architecture | by Irem Bezci | Mar, 2026 | InfoSec Write-ups

See the full picture, correlated to your attack surface.

LLM Guardrail Bypass and Prompt Injection Weaknesses

Get ahead of threats like this

How this story unfolded

Article frames system prompt leakage as a distinct enterprise AI security risk

Enterprise jailbreak risks framed as a security governance issue

Analysis details structural failures in layered LLM defenses

Lakera launches Gandalf prompt-injection challenge

Related entities

Sources

The Script Behind the Stage: Prompt Leaking and the Secrets Your AI Holds - TheCyberThrone

Politely Ask Your AI to Misbehave - It will Jailbreak the GuardRail - TheCyberThrone

How Prompts Break Systems: A Practical Analysis of LLM Defense Architecture | by Irem Bezci | Mar, 2026 | InfoSec Write-ups

See the full picture, correlated to your attack surface.

Related stories

Prompt injection and multimodal 'promptware' attacks against LLM-based systems

Research on Defending and Exploiting LLMs via Jailbreak and Prompt-Manipulation Techniques

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications