Anthropic Tightens AI Cyber Controls With Safeguards and Jailbreak Bounty

EVENT TIMELINE

How this story unfolded

3 events from the most recent confirmed update back to the earliest known activity.

3 EVENTS

Jul 2, 20262d ago

IBM and Red Hat launch Project Lightwell

IBM and Red Hat launched Project Lightwell, a $5 billion subscription service intended to identify vulnerabilities in deployed open-source versions, create backported fixes, and deliver signed validated patches under SLAs. IBM said 20,000 engineers were assigned to the effort, with 11 major financial institutions as design partners and Deloitte supporting regulated supply-chain and patch-deployment work.

Anthropic's AI Finds Bugs. IBM Bets $5B It Can Fix Them.

Jul 1, 20263d ago

Anthropic publishes HackerOne cyber jailbreak program for Claude Fable 5

Anthropic published a HackerOne program focused on identifying high-impact jailbreaks against Claude Fable 5 that materially increase offensive cyber capability. The policy defined scope, exclusions, severity criteria, reporting requirements, and safe-harbor expectations for researchers.

Anthropic Cyber Jailbreak | Response Policy | HackerOne

Jun 30, 20264d ago

Anthropic announces real-time cyber safeguards for Claude Opus and Sonnet

Anthropic announced new real-time cyber safeguards for Claude Opus and Sonnet that automatically detect and block requests associated with prohibited or high-risk cybersecurity use. The company said certain high-risk dual-use activities may be reviewed through its Cyber Verification Program for legitimate defensive professionals.

Real-time cyber safeguards on Claude Opus and Sonnet | Claude Help Center

LINKED ENTITIES

Related entities

Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.

41 LINKEDOpen in app

Affected products

4 linked

Claude CodeAzure PortalRed Hat Enterprise LinuxMicrosoft 365

Organizations

37 linked

AnthropicHackerOneWells FargoMorgan StanleyCisco SystemsRed HatNvidiaAmazon Web ServicesCitigroupSecuronixPalo Alto NetworksChainguardState StreetInternational Business MachinesRBCVisaSonarCloud Security AllianceBank of AmericaOmdiaCrowdStrikeActivestateAppleDeloitteMastercardBroadcomMicrosoft CorporationJPMorgan ChaseEndor LabsGoogleInternational Data CorporationO'Reilly MediaBNYTideliftThe Goldman Sachs GroupVulNowSeal Security

SOURCE COVERAGE

Sources

4 references tracked. Mallory keeps watching after this page renders.

4 SOURCESView all

Dark ReadingNews

Jul 2, 2026

Anthropic's AI Finds Bugs. IBM Bets $5B It Can Fix Them.

darkreading.com

Open source

Securitysenses BlogNews

Jul 2, 2026

Anthropic and The Monster Outside the Fable | SecuritySenses

securitysenses.com

Open source

HackeroneNews

Jul 1, 2026

Anthropic Cyber Jailbreak | Response Policy | HackerOne

hackerone.com

Open source

Support ClaudeNews

Jun 30, 2026

Real-time cyber safeguards on Claude Opus and Sonnet | Claude Help Center

support.claude.com

Open source

ON THE SAME THREAD

Anthropic said it has globally re-deployed **Claude Fable 5** with updated cybersecurity controls and released new technical details on how the model handles cyber-related prompts. The company said the system uses safety classifiers to sort requests into **prohibited**, **high-risk dual-use**, **low-risk dual-use**, and **benign** categories rather than blocking all security activity, allowing some defensive and educational use while aiming to stop harmful assistance. Anthropic said prohibited requests include malware development, ransomware, wipers, data exfiltration, defense evasion, offensive infrastructure such as `C2`, destructive attacks, and cyber-physical sabotage, and that the model applies a larger safety margin than earlier versions to reduce dangerous outputs even if that increases false positives. Anthropic also published an early draft **Cyber Jailbreak Severity (CJS)** framework, developed with **Glasswing**, to rate AI jailbreaks from `CJS-0` to `CJS-4` based on capability gain, breadth of impact, ease of weaponization, and discoverability. The company said the framework is intended to create a shared vocabulary for assessing jailbreak risk across industry and government, particularly for cases involving high-uplift vulnerability discovery or exploit generation. Anthropic invited external feedback through a dedicated contact channel and launched a **HackerOne** bug bounty program for researchers to report potential cyber jailbreaks affecting Fable 5.

Jul 3, 2026

Anthropic Restricts Mythos AI After It Finds and Exploits Thousands of Software Flaws

Anthropic unveiled **Claude Mythos Preview**, an unreleased AI model it says can autonomously discover and exploit severe software vulnerabilities across major operating systems, browsers, open-source projects, and some closed-source targets. The company said the model uncovered thousands of high-severity flaws, including long-lived bugs in **OpenBSD**, **FFmpeg**, Linux kernel privilege-escalation chains, and **`CVE-2026-4747`**, a FreeBSD NFS remote code execution flaw that could enable unauthenticated root access. Anthropic withheld broad release, citing offensive cyber risk, and instead launched **Project Glasswing**, a gated program for roughly 40 to 50 partners such as AWS, Apple, Cisco, Cloudflare, Google, JPMorgan Chase, Microsoft, Mozilla, NVIDIA, and Palo Alto Networks to validate findings, patch affected software, and study defensive uses. Independent and industry assessments broadly agreed Mythos marks a significant advance in AI-enabled cyber capability, though several researchers questioned how much of Anthropic’s headline claims can yet be verified through public CVEs and warned that similar results may be reproducible with cheaper or open models plus strong tooling. The UK AI Security Institute found Mythos achieved a **73%** success rate on expert capture-the-flag tasks and completed a full 32-step simulated enterprise attack in 3 of 10 runs, while Anthropic later reported coordinated disclosure activity spanning **1,596 vulnerabilities across 281 open-source projects** and partners identifying more than **10,000** high- or critical-severity candidates. Governments, financial regulators, and CISO groups in the US, UK, Europe, Canada, and Japan responded with briefings and warnings that AI is compressing the gap between vulnerability discovery and weaponization, leaving remediation, patch governance, and defensive automation as the main bottlenecks.

Jun 29, 2026

Claude Fable 5 Jailbreak Exposes Limits of Anthropic’s Cyber Guardrails

Anthropic launched **Claude Fable 5** as the public, safeguarded version of its new **Mythos 5** model, describing both as configurations of the same underlying system and positioning Mythos 5 as its strongest cyber-capable model to date. In its system card, Anthropic said Fable 5 uses classifiers and fallback behavior to divert high-risk requests to **Claude Opus 4.8**, keeping its cyber performance roughly in line with the older model while Mythos 5 remains restricted to vetted partners through **Project Glasswing**. The company assessed Mythos 5 as a **Tier 1 cyber offense risk**, meaning it can materially assist offensive operations but does not autonomously conduct adaptive cyber campaigns. Soon after release, researcher **Pliny the Liberator** reportedly bypassed Fable 5’s safeguards using a multi-agent jailbreak that combined Unicode obfuscation, long-context manipulation, narrative framing, and decomposition of harmful requests, allegedly eliciting exploit-development and chemistry-related guidance and exposing a leaked system prompt on GitHub. The incident undercut Anthropic’s claim that more than 1,000 hours of external testing had found no universal jailbreaks, while also intensifying criticism from security researchers who said Fable 5’s guardrails were already so broad that they blocked benign work such as reading security blogs, reviewing code, and writing secure software. Anthropic had not publicly responded to the jailbreak claims at the time of reporting.

Jul 2, 2026

Anthropic Tightens AI Cyber Controls With Safeguards and Jailbreak Bounty

Get ahead of threats like this

How this story unfolded

IBM and Red Hat launch Project Lightwell

Anthropic publishes HackerOne cyber jailbreak program for Claude Fable 5

Anthropic announces real-time cyber safeguards for Claude Opus and Sonnet

Related entities

Sources

Anthropic's AI Finds Bugs. IBM Bets $5B It Can Fix Them.

Anthropic and The Monster Outside the Fable | SecuritySenses

Anthropic Cyber Jailbreak | Response Policy | HackerOne

Real-time cyber safeguards on Claude Opus and Sonnet | Claude Help Center

See the full picture, correlated to your attack surface.

Anthropic Tightens AI Cyber Controls With Safeguards and Jailbreak Bounty

Get ahead of threats like this

How this story unfolded

IBM and Red Hat launch Project Lightwell

Anthropic publishes HackerOne cyber jailbreak program for Claude Fable 5

Anthropic announces real-time cyber safeguards for Claude Opus and Sonnet

Related entities

Sources

Anthropic's AI Finds Bugs. IBM Bets $5B It Can Fix Them.

Anthropic and The Monster Outside the Fable | SecuritySenses

Anthropic Cyber Jailbreak | Response Policy | HackerOne

Real-time cyber safeguards on Claude Opus and Sonnet | Claude Help Center

See the full picture, correlated to your attack surface.

Anthropic Tightens AI Cyber Controls With Safeguards and Jailbreak Bounty

Get ahead of threats like this

How this story unfolded

IBM and Red Hat launch Project Lightwell

Anthropic publishes HackerOne cyber jailbreak program for Claude Fable 5

Anthropic announces real-time cyber safeguards for Claude Opus and Sonnet

Related entities

Sources

Anthropic's AI Finds Bugs. IBM Bets $5B It Can Fix Them.

Anthropic and The Monster Outside the Fable | SecuritySenses

Anthropic Cyber Jailbreak | Response Policy | HackerOne

Real-time cyber safeguards on Claude Opus and Sonnet | Claude Help Center

See the full picture, correlated to your attack surface.

Related stories

Anthropic Publishes Claude Fable 5 Cyber Safeguards and Jailbreak Severity Framework

Anthropic Restricts Mythos AI After It Finds and Exploits Thousands of Software Flaws

Claude Fable 5 Jailbreak Exposes Limits of Anthropic’s Cyber Guardrails

Anthropic Tightens AI Cyber Controls With Safeguards and Jailbreak Bounty

Get ahead of threats like this

How this story unfolded

IBM and Red Hat launch Project Lightwell

Anthropic publishes HackerOne cyber jailbreak program for Claude Fable 5

Anthropic announces real-time cyber safeguards for Claude Opus and Sonnet

Related entities

Sources

Anthropic's AI Finds Bugs. IBM Bets $5B It Can Fix Them.

Anthropic and The Monster Outside the Fable | SecuritySenses

Anthropic Cyber Jailbreak | Response Policy | HackerOne

Real-time cyber safeguards on Claude Opus and Sonnet | Claude Help Center

See the full picture, correlated to your attack surface.

Related stories

Anthropic Publishes Claude Fable 5 Cyber Safeguards and Jailbreak Severity Framework

Anthropic Restricts Mythos AI After It Finds and Exploits Thousands of Software Flaws

Claude Fable 5 Jailbreak Exposes Limits of Anthropic’s Cyber Guardrails