Skip to main content
Live Webinar with SANS (June 25)— Agentic CTI Automation for Fun & ProfitRegister Free
Mallory
Back to intelligence
ai-platform-securityprivacy-surveillance-policy

Copyright and Training-Data Extraction Risks in Large Language Models

Updated 3mo agoFirst seen Feb 24, 20263 sources

Research and legal reporting highlighted that LLMs can be induced to reproduce near-verbatim copyrighted text from their training data, raising both copyright and data-governance risk. Coverage cited a prior US court outcome involving Anthropic, where training on copyrighted works was treated as potentially transformative (fair use), but retaining pirated copies was characterized as infringing and reportedly contributed to a $1.5B settlement; separate reporting referenced a German decision finding infringement where a model memorized song lyrics. Anthropic disputed the practical exploitability of the demonstrated “jailbreaking” approach and reiterated that its models learn statistical patterns rather than storing exact dataset copies, while researchers and legal experts argued that full-book reproduction without special access controls would likely constitute copyright violation and could create liability depending on scale and safeguards.

Separately, Elon Musk publicly accused Anthropic of “stealing training data” at massive scale and claimed the company paid multi‑billion‑dollar settlements—an allegation framed as part of the broader industry dispute over web scraping, copyrighted inputs, and the ethics/legality of training data collection. A third report about the “Stargate” initiative described business and governance disputes delaying planned OpenAI/Oracle/SoftBank AI data centers and does not materially address training-data theft, model memorization, or copyright-extraction risk.

Share:
Copyright and Training-Data Extraction Risks in Large Language Models
Stay ahead

Get ahead of threats like this

Mallory correlates global threat intelligence with your attack surface — know if you’re exposed before adversaries strike.

EVENT TIMELINE

How this story unfolded

3 events from the most recent confirmed update back to the earliest known activity.

3 EVENTS
Mar 17, 20263mo ago

Chicken Soup for the Soul LLC sues Anthropic over training data

A lawsuit titled "Chicken Soup for the Soul LLC v. Anthropic" was filed, alleging Anthropic improperly used copyrighted material for AI training. The filing reflects the broader wave of legal challenges over how AI companies source training data.

Feb 24, 20264mo ago

Elon Musk publicly accuses Anthropic of large-scale data theft

On X, Elon Musk accused Anthropic of stealing data at massive scale to train its AI models and claimed the conduct led to multi-billion-dollar settlements. The comments amplified existing controversy over scraping, copyright, and the legality of AI training data collection.

Feb 23, 20264mo ago

Ars Technica reports AIs can reproduce near-verbatim novel text

Ars Technica reported that AI systems can generate near-verbatim copies of novels from their training data, highlighting evidence that copyrighted works may be memorized and reproduced. The report added technical and public context to ongoing disputes over AI training practices.

LINKED ENTITIES

Related entities

Vulnerabilities, threat actors, malware, products, organizations, and breaches Mallory has linked to this story.

7 LINKEDOpen in app
Organizations
7 linked
LinkedinTeslaAnthropicOpenaiXxAIGoogle News
The operational view lives in Mallory

See the full picture, correlated to your attack surface.

This page covers what’s public. Mallory adds the parts that aren’t — which of your assets are affected, which threat actors are using it right now, which detections to deploy, and what to do next.
Exposure mapping

Map indicators from this story to your assets and identify affected systems in minutes.

Threat actor evidence

Every observed campaign, victim, and pivot linked to actors named in this story.

Associated malware

Malware, exploits, and IOCs connected to the activity described here.

Detection signatures

YARA, Sigma, and Snort rules deployed to your SIEM as soon as they’re published.

Scheduled alerts

Get matching new stories delivered to your team as they break — not the next morning.

AI threads

Ask questions about this story and take action on the answers.