AI Safety Filters in Large Language Models Fail During Extended Conversations

conversationsAImulti-turnfiltersmodelsMeta

Updated November 6, 2025 at 05:01 PM2 sources

Get Ahead of Threats Like This

Know if you're exposed — before adversaries strike.

Researchers have identified that safety filters in large language models (LLMs) can be bypassed during extended, multi-turn conversations, significantly increasing the risk of adversarial prompt success. Cisco's research demonstrated that attack success rates jump from an average of 13% for single prompts to 64% in longer chats, with some models like Meta's Llama 3.3-70B-Instruct and Alibaba’s Qwen3-32B reaching nearly 93% failure rates. These vulnerabilities are attributed to the models' architectural design, which processes dialogue through sliding context windows and does not consistently reapply safety judgments across conversation turns.

The most capable and open models are the most susceptible to these failures, while more conservatively aligned models such as Google's Gemma 3-1B-IT show smaller gaps between single- and multi-turn failures. Attackers can exploit these weaknesses by gradually shifting the context or rephrasing requests, eventually eliciting responses that bypass initial safety mechanisms and may include the generation of malicious code. The findings highlight a critical challenge in evaluating and securing LLMs, as traditional one-shot prompt testing fails to capture these multi-turn vulnerabilities.

Sources

govinfosecurity

Longer Conversations Can Break AI Safety Filters

November 6, 2025 at 12:00 AM

bank info security

Longer Conversations Can Break AI Safety Filters

November 6, 2025 at 12:00 AM

Cisco reported that **multi-turn jailbreak** techniques—iterative, conversational prompt sequences designed to erode safety guardrails—successfully bypassed protections in eight major **open-weight** large language models **92.78%** of the time, while single-turn prompt attempts were notably less effective. The findings, published in Cisco’s *State of AI Security* research and covered by multiple outlets, highlight that many enterprise AI deployments using downloadable, self-hosted models may be more vulnerable to sustained adversarial prompting than organizations assume. The report’s risk framing is amplified by broader concerns that model misuse and capability leakage can scale quickly: Anthropic separately alleged coordinated **model distillation** activity by Chinese AI labs using large volumes of fraudulent accounts and proxy infrastructure to extract advanced behaviors from *Claude*, warning that copied models may lack comparable safety controls and could be repurposed for malicious use. Related research coverage also notes that LLMs can sometimes be induced—via specialized prompting/jailbreaking methods—to reproduce near-verbatim copyrighted text from training data, underscoring that prompt-based attacks can drive both **policy bypass** and **data/content extraction** outcomes, particularly when guardrails are tested over extended interactions.

3 weeks ago

Prompt Injection and Jailbreak Attacks on Large Language Models

Recent research has demonstrated that large language models (LLMs) such as GPT-5 and others are increasingly vulnerable to prompt injection and jailbreak attacks, which can be exploited to bypass built-in safety guardrails and leak sensitive information. Attackers use techniques like prompt injection—embedding malicious instructions within seemingly benign queries—to trick LLMs into revealing confidential data, including user credentials and internal documents. A notable study by Icaro Lab, in collaboration with Sapienza University and DEXAI, found that adversarial prompts written as poetry could successfully bypass safety mechanisms in 62% of tested cases across 25 frontier models, with some models exceeding a 90% success rate. These findings highlight the sophistication and creativity of new attack vectors targeting AI systems, raising significant concerns for organizations embedding LLMs into business operations. The widespread adoption of LLMs in handling sensitive business functions amplifies the risk of data exfiltration through these advanced attack methods. As organizations increasingly rely on AI for customer service, document processing, and other critical tasks, the potential for prompt injection and poetic jailbreaks to facilitate unauthorized data access becomes a pressing security issue. The research underscores the urgent need for improved AI safety measures, robust prompt filtering, and continuous monitoring to mitigate the risks posed by these evolving adversarial techniques.

3 months ago

Security Risks and Privacy Challenges of Large Language Models in AI Systems

Large language models (LLMs) present a dual-use dilemma in cybersecurity, as their capabilities can be leveraged for both defensive and offensive purposes. Security researchers have identified purpose-built malicious LLMs, such as WormGPT and KawaiiGPT, which are designed to facilitate cybercrime by generating convincing phishing content and rapidly producing or modifying malicious code. The thin line between beneficial and harmful use of LLMs is defined largely by developer intent and the presence or absence of ethical safeguards, raising concerns about the proliferation of offensive AI tools in the threat landscape. In addition to malicious use, LLMs face significant challenges in maintaining privacy and security due to contextual integrity failures and regulatory-driven censorship. Research from Microsoft highlights the need for AI agents to respect contextual privacy norms, as current models may inadvertently leak sensitive information. Meanwhile, the DeepSeek-R1 model demonstrates how geopolitical censorship mechanisms can introduce security flaws, such as insecure code generation and broken authentication, especially when handling politically sensitive prompts. These issues underscore the urgent need for robust privacy controls and security-aware development practices in the deployment of LLM-powered systems.

3 months ago

Get Ahead of Threats Like This

Mallory continuously monitors global threat intelligence and correlates it with your attack surface. Know if you're exposed — before adversaries strike.

AI Safety Filters in Large Language Models Fail During Extended Conversations

Get Ahead of Threats Like This

Sources

Related Stories

Cisco Testing Finds Open-Weight LLMs Highly Susceptible to Multi-Turn Jailbreaks

Prompt Injection and Jailbreak Attacks on Large Language Models

Security Risks and Privacy Challenges of Large Language Models in AI Systems

Get Ahead of Threats Like This