AI Safety Filters in Large Language Models Fail During Extended Conversations
Researchers have identified that safety filters in large language models (LLMs) can be bypassed during extended, multi-turn conversations, significantly increasing the risk of adversarial prompt success. Cisco's research demonstrated that attack success rates jump from an average of 13% for single prompts to 64% in longer chats, with some models like Meta's Llama 3.3-70B-Instruct and Alibaba’s Qwen3-32B reaching nearly 93% failure rates. These vulnerabilities are attributed to the models' architectural design, which processes dialogue through sliding context windows and does not consistently reapply safety judgments across conversation turns.
The most capable and open models are the most susceptible to these failures, while more conservatively aligned models such as Google's Gemma 3-1B-IT show smaller gaps between single- and multi-turn failures. Attackers can exploit these weaknesses by gradually shifting the context or rephrasing requests, eventually eliciting responses that bypass initial safety mechanisms and may include the generation of malicious code. The findings highlight a critical challenge in evaluating and securing LLMs, as traditional one-shot prompt testing fails to capture these multi-turn vulnerabilities.

Get ahead of threats like this
Mallory correlates global threat intelligence with your attack surface — know if you’re exposed before adversaries strike.
How this story unfolded
2 events from the most recent confirmed update back to the earliest known activity.
Study documents effective adaptive attack techniques and model-specific failure rates
The research reported five consistently effective multi-turn attack methods: incremental escalation, misdirection, information reassembly, refusal reframing, and persona adoption. It also found some models, including Meta's Llama 3.3-70B-Instruct, Mistral Large-2, and Alibaba's Qwen3-32B, reached nearly 93% failure rates, while safety-focused models such as Google's Gemma 3-1B-IT and OpenAI's GPT-OSS-20B were more resilient.
Cisco researchers identify multi-turn jailbreak weakness in open-weight LLMs
Cisco researchers found that safety filters in open-weight language models fail far more often in extended multi-turn conversations than in single-prompt tests, with average attack success rising from 13% to 64%. The study attributed the weakness to models failing to maintain consistent safety judgments across turns, enabling attackers to gradually rephrase or escalate requests until the model complies.
Related entities
Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.
Sources
2 references tracked. Mallory keeps watching after this page renders.
See the full picture, correlated to your attack surface.
Map indicators from this story to your assets and identify affected systems in minutes.
Every observed campaign, victim, and pivot linked to actors named in this story.
Malware, exploits, and IOCs connected to the activity described here.
YARA, Sigma, and Snort rules deployed to your SIEM as soon as they’re published.
Get matching new stories delivered to your team as they break — not the next morning.
Ask questions about this story and take action on the answers.


