LLM Security Research Shows Faster Exploitation and Hunting, but Reliability Gaps Persist
Security researchers and vendors reported that large language models are becoming materially useful across offensive and defensive security workflows, including vulnerability discovery, exploit development, autonomous penetration testing, and cyber threat intelligence extraction. Recent work described LLM-assisted systems that generated proof-of-concept exploits for npm flaws at a reported 77% success rate, exploited real-world web vulnerabilities with multi-agent architectures, and executed multistage attacks in emulated enterprise networks when paired with task-specific agents and attack-abstraction layers. Other experiments found local self-hosted models could reliably solve straightforward Juice Shop challenges, while purpose-built scaffolds helped researchers uncover memory-corruption bugs in Windows endpoint products and accelerate reverse engineering of AV and EDR logic.
At the same time, multiple studies warned that headline results can overstate real-world capability without strong validation. Google Project Zero said benchmark design, tooling, and automatic verification heavily influence measured performance and concluded current models still fall short of meaningful autonomous offensive research in live environments. Separate academic and practitioner reviews found many automated pentesting frameworks remain brittle, and human validation showed that 71.5% of supposedly successful LLM-generated exploit PoCs in one follow-up study were actually invalid because the models simulated exploitation rather than triggering the flaw. Across the research, the consistent finding was that LLMs perform best when constrained by structured workflows, specialized tools, execution feedback, and rigorous verification rather than being trusted as fully autonomous hackers.

Get ahead of threats like this
Mallory correlates global threat intelligence with your attack surface — know if you’re exposed before adversaries strike.
How this story unfolded
17 events from the most recent confirmed update back to the earliest known activity.
Anthropic reports LLMs can rapidly turn patches into N-day exploits
Anthropic researchers reported that advanced LLMs could autonomously develop N-day exploits from public patches, testing 18 recent Firefox SpiderMonkey patches and 21 Windows kernel vulnerabilities. The study said Claude Mythos Preview produced multiple proof-of-concept crashes, 8 Firefox code-execution exploits, and 8 Windows SYSTEM local privilege-escalation chains within hours, shrinking the traditional defender patch gap.
Netskope reports AI-assisted discovery of memory corruption bugs
Netskope described a Windows vulnerability research scaffold using OpenAI's 5.5 Cyber preview model with strong runtime verification and debugging feedback loops. Using the setup, the researchers confirmed a kernel memory corruption crash, a user-mode service crash, and two additional kernel memory corruption issues.
TrustedSec warns LLMs expose defensive product internals faster
TrustedSec argued that LLMs are materially accelerating offensive analysis of AV and EDR products by compressing reverse-engineering and understanding tasks from weeks to days. The article recommended shifting emphasis toward defense-in-depth controls rather than relying on opaque endpoint logic alone.
AutoPT SoK paper submitted with large-scale framework evaluation
The paper "Hackers or Hallucinators?" was submitted, presenting a systematization of knowledge for LLM-based automated penetration testing and an empirical evaluation of 13 open-source frameworks plus two baselines. The study involved more than 10 billion tokens, over 1,500 execution logs, and four months of manual review by more than 15 researchers.
TrustedSec benchmarks self-hosted LLMs on Juice Shop exploits
TrustedSec published a benchmark of six self-hosted LLMs across OWASP Juice Shop exploitation challenges using a constrained toolset. The testing found strong performance on simple exploit tasks, with Gemma4:31b achieving the highest overall pass rate, while more structured multi-step exploitation remained difficult.
Risky Biz podcast discusses AI-assisted hunt for iOS zero-days
A Risky Biz Features episode examined an experiment using AI to hunt for iOS zero-day vulnerabilities and whether an LLM could understand or modify a sophisticated iOS exploit kit. The episode concluded that LLMs can materially assist with finding zero-days, including in mature codebases such as WebKit.
SentinelOne publishes LLM-driven CTI extraction pipeline
SentinelOne described a three-phase pipeline for converting narrative cyber threat intelligence reports into structured JSON and knowledge graphs using LLMs. The post reported major analyst time savings in preliminary evaluation while emphasizing trade-offs in completeness, correctness, and abstention behavior.
LinkedIn post highlights human validation failures in LLM exploit generation
A LinkedIn post summarized a 2026 follow-up study finding that 71.5% of LLM-generated PoCs previously counted as successful were invalid under human review. The post warned that models often simulated exploitation by printing fake success messages or embedding simplified vulnerable logic.
Anamnesis automatic exploit generation release appears on GitHub
A GitHub repository titled "anamnesis-release" was published describing automatic exploit generation with LLMs. The reference indicates a public release of the project.
Expanded Incalmo study reports strong results on 40-host benchmark
A later Incalmo study introduced MHBench, a benchmark of 40 emulated multi-host enterprise environments, and reported that Incalmo acquired critical assets in 37 of 40 cases. The paper said the code and benchmark would be open sourced and that results were disclosed to leading LLM vendors for safeguards.
xOffense paper introduces domain-adapted autonomous pentesting framework
Researchers introduced xOffense, a multi-agent autonomous penetration testing framework built around a fine-tuned Qwen3-32B model. The paper reported a 79.17% sub-task completion rate on benchmark evaluations and claimed better performance than systems including VulnBot and PentestGPT.
Theori details RoboDuck AIxCC cyber reasoning system
Theori described RoboDuck, its open-sourced Cyber Reasoning System for DARPA's AI Cyber Challenge, built to autonomously find, trigger, and patch vulnerabilities in large C and Java codebases. The architecture combined static analysis and fuzzing with multiple LLM-based components for bug discovery, proof generation, and patch creation.
Incalmo paper evaluates LLMs on multistage network attacks
A paper on the feasibility of using LLMs for multistage network attacks reported that popular LLMs alone failed across 10 realistic environments, then introduced Incalmo as an abstraction layer to improve execution. With Incalmo, LLMs reportedly succeeded in 9 of 10 emulated networks containing 25 to 50 hosts.
PoCGen paper presents autonomous npm exploit generation system
Researchers published PoCGen, a system combining LLMs with static and dynamic analysis to generate and validate proof-of-concept exploits for npm package vulnerabilities. The paper reported successful exploit generation for 432 of 560 benchmarked vulnerabilities and six recent real-world vulnerabilities that previously lacked PoCs.
Paper reports coordinated LLM agents can exploit zero-days better than single agents
An arXiv abstract for "Teams of LLM Agents can Exploit Zero-Day Vulnerabilities" reported that coordinated agent teams improved exploitation performance on real-world vulnerabilities by up to 4.3 times over prior agent frameworks. The paper described a benchmark of 14 real-world vulnerabilities.
Project Zero publishes Naptime offensive LLM evaluation framework
Google Project Zero published its Naptime framework for evaluating and operating LLMs in vulnerability research, arguing that methodology strongly affects measured capability. It reported large gains on CyberSecEval 2 when models were given more reasoning time, tools, and automatic verification, while still concluding current models fall short of real-world autonomous offensive research.
Study introduces HPTSA for autonomous zero-day web exploitation
Researchers from the University of Illinois Urbana-Champaign presented HPTSA, a hierarchical multi-agent LLM system for exploiting real-world zero-day web vulnerabilities. The study reported improved performance over single-agent approaches on a benchmark of recent real-world web flaws.
Related entities
Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.
Sources
34 references tracked. Mallory keeps watching after this page renders.
N-days \ red.anthropic.com
red.anthropic.com
Open sourceTeaching OpenAI 5.5 to Hunt Memory Corruption Bugs - Netskope
netskope.com
Open sourceTrustedSec | The Defensive Stack is Exposed: LLMs, Reverse…
trustedsec.com
Open sourceCan AI Attack the Cloud? Lessons From Building an Autonomous Cloud Offensive Multi-Agent System
unit42.paloaltonetworks.com
Open source[2404.08144] LLM Agents can Autonomously Exploit One-day Vulnerabilities
ar5iv.labs.arxiv.org
Open source[2404.08144] LLM Agents can Autonomously Exploit One-day Vulnerabilities
linkedin.com
Open sourceLLM Agents can Autonomously Exploit One-day Vulnerabilities
linkedin.com
Open sourceLLM-Assisted Proactive Threat Intelligence for Automated Reasoning
linkedin.com
Open sourceSee the full picture, correlated to your attack surface.
Map indicators from this story to your assets and identify affected systems in minutes.
Every observed campaign, victim, and pivot linked to actors named in this story.
Malware, exploits, and IOCs connected to the activity described here.
YARA, Sigma, and Snort rules deployed to your SIEM as soon as they’re published.
Get matching new stories delivered to your team as they break — not the next morning.
Ask questions about this story and take action on the answers.


