Skip to main content
Live Webinar with SANS (June 25)— Agentic CTI Automation for Fun & ProfitRegister Free
Mallory
Back to intelligence
ai-enabled-threat-activityadversary-emulation-tradecraftoffensive-tooling-releaseproof-of-concept-release

LLM Security Research Shows Faster Exploitation and Hunting, but Reliability Gaps Persist

Updated 12d agoFirst seen Mar 11, 202634 sources

Security researchers and vendors reported that large language models are becoming materially useful across offensive and defensive security workflows, including vulnerability discovery, exploit development, autonomous penetration testing, and cyber threat intelligence extraction. Recent work described LLM-assisted systems that generated proof-of-concept exploits for npm flaws at a reported 77% success rate, exploited real-world web vulnerabilities with multi-agent architectures, and executed multistage attacks in emulated enterprise networks when paired with task-specific agents and attack-abstraction layers. Other experiments found local self-hosted models could reliably solve straightforward Juice Shop challenges, while purpose-built scaffolds helped researchers uncover memory-corruption bugs in Windows endpoint products and accelerate reverse engineering of AV and EDR logic.

At the same time, multiple studies warned that headline results can overstate real-world capability without strong validation. Google Project Zero said benchmark design, tooling, and automatic verification heavily influence measured performance and concluded current models still fall short of meaningful autonomous offensive research in live environments. Separate academic and practitioner reviews found many automated pentesting frameworks remain brittle, and human validation showed that 71.5% of supposedly successful LLM-generated exploit PoCs in one follow-up study were actually invalid because the models simulated exploitation rather than triggering the flaw. Across the research, the consistent finding was that LLMs perform best when constrained by structured workflows, specialized tools, execution feedback, and rigorous verification rather than being trusted as fully autonomous hackers.

Share:
LLM Security Research Shows Faster Exploitation and Hunting, but Reliability Gaps Persist
Stay ahead

Get ahead of threats like this

Mallory correlates global threat intelligence with your attack surface — know if you’re exposed before adversaries strike.

EVENT TIMELINE

How this story unfolded

17 events from the most recent confirmed update back to the earliest known activity.

17 EVENTS
Jun 8, 202615d ago

Anthropic reports LLMs can rapidly turn patches into N-day exploits

Anthropic researchers reported that advanced LLMs could autonomously develop N-day exploits from public patches, testing 18 recent Firefox SpiderMonkey patches and 21 Windows kernel vulnerabilities. The study said Claude Mythos Preview produced multiple proof-of-concept crashes, 8 Firefox code-execution exploits, and 8 Windows SYSTEM local privilege-escalation chains within hours, shrinking the traditional defender patch gap.

N-days \ red.anthropic.com
May 14, 20261mo ago

Netskope reports AI-assisted discovery of memory corruption bugs

Netskope described a Windows vulnerability research scaffold using OpenAI's 5.5 Cyber preview model with strong runtime verification and debugging feedback loops. Using the setup, the researchers confirmed a kernel memory corruption crash, a user-mode service crash, and two additional kernel memory corruption issues.

Teaching OpenAI 5.5 to Hunt Memory Corruption Bugs - Netskope
May 5, 20262mo ago

TrustedSec warns LLMs expose defensive product internals faster

TrustedSec argued that LLMs are materially accelerating offensive analysis of AV and EDR products by compressing reverse-engineering and understanding tasks from weeks to days. The article recommended shifting emphasis toward defense-in-depth controls rather than relying on opaque endpoint logic alone.

TrustedSec | The Defensive Stack is Exposed: LLMs, Reverse…
Apr 7, 20263mo ago

AutoPT SoK paper submitted with large-scale framework evaluation

The paper "Hackers or Hallucinators?" was submitted, presenting a systematization of knowledge for LLM-based automated penetration testing and an empirical evaluation of 13 open-source frameworks plus two baselines. The study involved more than 10 billion tokens, over 1,500 execution logs, and four months of manual review by more than 15 researchers.

[2604.05719] Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
Apr 3, 20263mo ago

TrustedSec benchmarks self-hosted LLMs on Juice Shop exploits

TrustedSec published a benchmark of six self-hosted LLMs across OWASP Juice Shop exploitation challenges using a constrained toolset. The testing found strong performance on simple exploit tasks, with Gemma4:31b achieving the highest overall pass rate, while more structured multi-step exploitation remained difficult.

TrustedSec | Benchmarking Self-Hosted LLMs for Offensive Security
Mar 31, 20263mo ago

Risky Biz podcast discusses AI-assisted hunt for iOS zero-days

A Risky Biz Features episode examined an experiment using AI to hunt for iOS zero-day vulnerabilities and whether an LLM could understand or modify a sophisticated iOS exploit kit. The episode concluded that LLMs can materially assist with finding zero-days, including in mature codebases such as WebKit.

A Risky Biz Experiment: Hunting for iOS 0day with AI - Risky Business Media
Mar 9, 20264mo ago

SentinelOne publishes LLM-driven CTI extraction pipeline

SentinelOne described a three-phase pipeline for converting narrative cyber threat intelligence reports into structured JSON and knowledge graphs using LLMs. The post reported major analyst time savings in preliminary evaluation while emphasizing trade-offs in completeness, correctness, and abstention behavior.

From Narrative to Knowledge Graph | LLM-Driven Information Extraction in Cyber Threat Intelligence | SentinelOne
Feb 26, 20264mo ago

LinkedIn post highlights human validation failures in LLM exploit generation

A LinkedIn post summarized a 2026 follow-up study finding that 71.5% of LLM-generated PoCs previously counted as successful were invalid under human review. The post warned that models often simulated exploitation by printing fake success messages or embedding simplified vulnerable logic.

LLM Exploit Generation Fails in Human Tests | Denis Laskov posted on the topic | LinkedIn
Jan 30, 20265mo ago

Anamnesis automatic exploit generation release appears on GitHub

A GitHub repository titled "anamnesis-release" was published describing automatic exploit generation with LLMs. The reference indicates a public release of the project.

GitHub - SeanHeelan/anamnesis-release: Automatic Exploit Generation with LLMs · GitHub
Nov 11, 20257mo ago

Expanded Incalmo study reports strong results on 40-host benchmark

A later Incalmo study introduced MHBench, a benchmark of 40 emulated multi-host enterprise environments, and reported that Incalmo acquired critical assets in 37 of 40 cases. The paper said the code and benchmark would be open sourced and that results were disclosed to leading LLM vendors for safeguards.

Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks
Sep 16, 20259mo ago

xOffense paper introduces domain-adapted autonomous pentesting framework

Researchers introduced xOffense, a multi-agent autonomous penetration testing framework built around a fine-tuned Qwen3-32B model. The paper reported a 79.17% sub-task completion rate on benchmark evaluations and claimed better performance than systems including VulnBot and PentestGPT.

[2509.13021] xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models
Aug 8, 202511mo ago

Theori details RoboDuck AIxCC cyber reasoning system

Theori described RoboDuck, its open-sourced Cyber Reasoning System for DARPA's AI Cyber Challenge, built to autonomously find, trigger, and patch vulnerabilities in large C and Java codebases. The architecture combined static analysis and fuzzing with multiple LLM-based components for bug discovery, proof generation, and patch creation.

AI Cyber Challenge and Theori's RoboDuck - Xint
Jan 27, 20251y ago

Incalmo paper evaluates LLMs on multistage network attacks

A paper on the feasibility of using LLMs for multistage network attacks reported that popular LLMs alone failed across 10 realistic environments, then introduced Incalmo as an abstraction layer to improve execution. With Incalmo, LLMs reportedly succeeded in 9 of 10 emulated networks containing 25 to 50 hosts.

[2501.16466] Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks
Jul 18, 20242y ago

PoCGen paper presents autonomous npm exploit generation system

Researchers published PoCGen, a system combining LLMs with static and dynamic analysis to generate and validate proof-of-concept exploits for npm package vulnerabilities. The paper reported successful exploit generation for 432 of 560 benchmarked vulnerabilities and six recent real-world vulnerabilities that previously lacked PoCs.

PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
Jun 2, 20242y ago

Paper reports coordinated LLM agents can exploit zero-days better than single agents

An arXiv abstract for "Teams of LLM Agents can Exploit Zero-Day Vulnerabilities" reported that coordinated agent teams improved exploitation performance on real-world vulnerabilities by up to 4.3 times over prior agent frameworks. The paper described a benchmark of 14 real-world vulnerabilities.

[2406.01637] Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
Jun 1, 20242y ago

Project Zero publishes Naptime offensive LLM evaluation framework

Google Project Zero published its Naptime framework for evaluating and operating LLMs in vulnerability research, arguing that methodology strongly affects measured capability. It reported large gains on CyberSecEval 2 when models were given more reasoning time, tools, and automatic verification, while still concluding current models fall short of real-world autonomous offensive research.

Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models - Project Zero
May 19, 20242y ago

Study introduces HPTSA for autonomous zero-day web exploitation

Researchers from the University of Illinois Urbana-Champaign presented HPTSA, a hierarchical multi-agent LLM system for exploiting real-world zero-day web vulnerabilities. The study reported improved performance over single-agent approaches on a benchmark of recent real-world web flaws.

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
LINKED ENTITIES

Related entities

Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.

57 LINKEDOpen in app
The operational view lives in Mallory

See the full picture, correlated to your attack surface.

This page covers what’s public. Mallory adds the parts that aren’t — which of your assets are affected, which threat actors are using it right now, which detections to deploy, and what to do next.
Exposure mapping

Map indicators from this story to your assets and identify affected systems in minutes.

Threat actor evidence

Every observed campaign, victim, and pivot linked to actors named in this story.

Associated malware

Malware, exploits, and IOCs connected to the activity described here.

Detection signatures

YARA, Sigma, and Snort rules deployed to your SIEM as soon as they’re published.

Scheduled alerts

Get matching new stories delivered to your team as they break — not the next morning.

AI threads

Ask questions about this story and take action on the answers.

LLM Security Research Shows Faster Exploitation and Hunting, but Reliability Gaps Persist | Mallory