Research on Defending and Exploiting LLMs via Jailbreak and Prompt-Manipulation Techniques
Recent research highlights how LLM jailbreak and prompt-manipulation attacks can bypass safety controls, especially in multi-turn conversations where adversaries gradually escalate requests to elicit harmful or policy-violating output. A proposed defense framework, HoneyTrap, aims to counter these attacks with a multi-agent approach that goes beyond static filtering or supervised fine-tuning by using adaptive, deceptive responses intended to slow attackers and deny actionable information rather than simply refusing requests.
Separately, technical analysis of the LLM input-processing pipeline (tokenization, embeddings, attention, and context-window behavior) explains why common guardrails like keyword filters can fail and how attackers can exploit architectural properties (including Query-Key-Value attention dynamics) to steer model behavior. The research describes common offensive techniques—prompt injection, jailbreaking, and adversarial suffixes—and frames them as practical risks for enterprise deployments, particularly public-facing chatbots and other systems where organizations cannot fully control user input.
Related Entities
Organizations
Sources
Related Stories

Prompt Injection and Jailbreak Techniques Targeting LLM-Powered Applications
Security researchers and vendors are warning that **prompt injection and jailbreak techniques** remain a leading risk for enterprise deployments of large language models (LLMs), enabling attackers to override system instructions, bypass safety controls, and potentially drive **data exposure** outcomes. Resecurity reports assisting a Fortune 100 organization where AI-powered banking and HR applications were targeted with prompt-injection attempts, emphasizing that these attacks exploit model behavior rather than traditional software flaws and can be used in scenarios such as extracting sensitive configuration data (for example, attempts to elicit content resembling `/etc/passwd`). Resecurity also cites OWASP’s 2025 Top 10 for LLM Applications, where prompt injection is ranked as the top issue, and frames continuous security testing (e.g., VAPT) as a key control for enterprise AI systems. Separate research highlighted by Kaspersky describes a **“poetry” jailbreak** technique in which prompts framed as rhyming verse increased the likelihood that chatbots would produce disallowed or unsafe responses; the study tested this approach across 25 models from multiple vendors (including Anthropic, OpenAI, Google, Meta, DeepSeek, and xAI). In contrast, OpenAI’s planned upgrade to *ChatGPT Temporary Chat* is primarily a product/privacy change—adding optional personalization while keeping temporary chats out of history and model training (with possible retention for up to 30 days)—and does not describe a specific security incident or vulnerability disclosure tied to prompt injection or jailbreak research.
1 months agoPrompt Injection and Jailbreak Attacks on Large Language Models
Recent research has demonstrated that large language models (LLMs) such as GPT-5 and others are increasingly vulnerable to prompt injection and jailbreak attacks, which can be exploited to bypass built-in safety guardrails and leak sensitive information. Attackers use techniques like prompt injection—embedding malicious instructions within seemingly benign queries—to trick LLMs into revealing confidential data, including user credentials and internal documents. A notable study by Icaro Lab, in collaboration with Sapienza University and DEXAI, found that adversarial prompts written as poetry could successfully bypass safety mechanisms in 62% of tested cases across 25 frontier models, with some models exceeding a 90% success rate. These findings highlight the sophistication and creativity of new attack vectors targeting AI systems, raising significant concerns for organizations embedding LLMs into business operations. The widespread adoption of LLMs in handling sensitive business functions amplifies the risk of data exfiltration through these advanced attack methods. As organizations increasingly rely on AI for customer service, document processing, and other critical tasks, the potential for prompt injection and poetic jailbreaks to facilitate unauthorized data access becomes a pressing security issue. The research underscores the urgent need for improved AI safety measures, robust prompt filtering, and continuous monitoring to mitigate the risks posed by these evolving adversarial techniques.
3 months ago
Prompt injection and multimodal 'promptware' attacks against LLM-based systems
Security researchers and commentators warned that attacks on **LLM-based systems** are evolving beyond simple “prompt injection” into a broader execution mechanism dubbed **promptware**, with a proposed seven-step **promptware kill chain** to describe how malicious instructions enter and propagate through AI-enabled applications. The core risk highlighted is architectural: LLMs treat system instructions, user input, and retrieved content as a single token stream, enabling **indirect prompt injection** where hostile instructions are embedded in external data sources (web pages, emails, shared documents) that an LLM ingests at inference time; the attack surface expands further as models become **multimodal**, allowing instructions to be hidden in images or audio. Related academic work demonstrated a concrete multimodal variant against **embodied AI** using large vision-language models: **CHAI (Command Hijacking Against Embodied AI)**, which embeds deceptive natural-language instructions into visual inputs (e.g., road signs) to influence agent behavior in scenarios including drone emergency landing, autonomous driving, and object tracking, reportedly outperforming prior attacks in evaluations. Separately, reporting on a viral “AI caricature” social-media trend framed the risk as downstream **social engineering** and potential **LLM account takeover** leading to exposure of prompt histories and employer-sensitive data; while largely hypothetical, it underscores how widespread consumer LLM use and public oversharing can increase the likelihood and impact of prompt-driven compromise paths.
1 months ago