Skip to main content
Live Webinar with SANS (June 25)— Agentic CTI Automation for Fun & ProfitRegister Free
Mallory
Back to intelligence
ai-platform-securityinitial-access-method

AI Safety Filters in Large Language Models Fail During Extended Conversations

Updated 3mo agoFirst seen Nov 6, 20252 sources

Researchers have identified that safety filters in large language models (LLMs) can be bypassed during extended, multi-turn conversations, significantly increasing the risk of adversarial prompt success. Cisco's research demonstrated that attack success rates jump from an average of 13% for single prompts to 64% in longer chats, with some models like Meta's Llama 3.3-70B-Instruct and Alibaba’s Qwen3-32B reaching nearly 93% failure rates. These vulnerabilities are attributed to the models' architectural design, which processes dialogue through sliding context windows and does not consistently reapply safety judgments across conversation turns.

The most capable and open models are the most susceptible to these failures, while more conservatively aligned models such as Google's Gemma 3-1B-IT show smaller gaps between single- and multi-turn failures. Attackers can exploit these weaknesses by gradually shifting the context or rephrasing requests, eventually eliciting responses that bypass initial safety mechanisms and may include the generation of malicious code. The findings highlight a critical challenge in evaluating and securing LLMs, as traditional one-shot prompt testing fails to capture these multi-turn vulnerabilities.

Share:
AI Safety Filters in Large Language Models Fail During Extended Conversations
Stay ahead

Get ahead of threats like this

Mallory correlates global threat intelligence with your attack surface — know if you’re exposed before adversaries strike.

EVENT TIMELINE

How this story unfolded

2 events from the most recent confirmed update back to the earliest known activity.

2 EVENTS
Nov 6, 20258mo ago

Study documents effective adaptive attack techniques and model-specific failure rates

The research reported five consistently effective multi-turn attack methods: incremental escalation, misdirection, information reassembly, refusal reframing, and persona adoption. It also found some models, including Meta's Llama 3.3-70B-Instruct, Mistral Large-2, and Alibaba's Qwen3-32B, reached nearly 93% failure rates, while safety-focused models such as Google's Gemma 3-1B-IT and OpenAI's GPT-OSS-20B were more resilient.

Cisco researchers identify multi-turn jailbreak weakness in open-weight LLMs

Cisco researchers found that safety filters in open-weight language models fail far more often in extended multi-turn conversations than in single-prompt tests, with average attack success rising from 13% to 64%. The study attributed the weakness to models failing to maintain consistent safety judgments across turns, enabling attackers to gradually rephrase or escalate requests until the model complies.

LINKED ENTITIES

Related entities

Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.

6 LINKEDOpen in app
Organizations
6 linked
Alibaba CloudCisco SystemsMistral AIMeta PlatformsOpenaiGoogle
The operational view lives in Mallory

See the full picture, correlated to your attack surface.

This page covers what’s public. Mallory adds the parts that aren’t — which of your assets are affected, which threat actors are using it right now, which detections to deploy, and what to do next.
Exposure mapping

Map indicators from this story to your assets and identify affected systems in minutes.

Threat actor evidence

Every observed campaign, victim, and pivot linked to actors named in this story.

Associated malware

Malware, exploits, and IOCs connected to the activity described here.

Detection signatures

YARA, Sigma, and Snort rules deployed to your SIEM as soon as they’re published.

Scheduled alerts

Get matching new stories delivered to your team as they break — not the next morning.

AI threads

Ask questions about this story and take action on the answers.