← Back to incidents
Sarah Silverman and Authors Sue Meta and OpenAI for Copyright Infringement Over Book Training Data
HighComedian Sarah Silverman and authors sued Meta and OpenAI in July 2023, alleging their AI models were trained on pirated copies of their copyrighted books from Library Genesis without permission.
Category
Copyright Violation
Industry
Technology
Status
Litigation Pending
Date Occurred
Jun 28, 2023
Date Reported
Jul 10, 2023
Jurisdiction
US
AI Provider
OpenAI
Model
ChatGPT and LLaMA
Application Type
chatbot
Harm Type
financial
People Affected
3
Human Review in Place
No
Litigation Filed
Yes
Litigation Status
pending
copyrighttraining_datafair_useclass_actionbooksauthorslibrary_genesispiracylegal_precedent
Full Description
In July 2023, comedian and author Sarah Silverman, along with writers Christopher Golden and Richard Kadrey, filed separate but related class-action lawsuits against Meta Platforms and OpenAI in the U.S. District Court for the Northern District of California. The lawsuits alleged that both companies' AI language models - ChatGPT/GPT-3.5/GPT-4 from OpenAI and LLaMA from Meta - were trained on copyrighted books without permission from the authors or publishers.
The plaintiffs claimed that their works were obtained from Library Genesis (LibGen), a notorious "shadow library" website that hosts millions of pirated books and academic papers. The lawsuits alleged that the defendants used these unauthorized copies as training data for their large language models, enabling the AI systems to generate summaries and analyses of the copyrighted works. As evidence, the plaintiffs demonstrated that ChatGPT could produce accurate summaries of their books when prompted, suggesting the models had been trained on the full text of their works.
The legal filings argued that this constituted direct copyright infringement, vicarious copyright infringement, and violations of the Digital Millennium Copyright Act (DMCA). The authors sought monetary damages, including profits derived from the alleged infringement, as well as injunctive relief to prevent further unauthorized use of their works. The cases were structured as class actions, potentially representing thousands of authors whose works may have been similarly used without permission.
Both Meta and OpenAI filed motions to dismiss the lawsuits, arguing that their use of copyrighted material constituted fair use under copyright law. The companies contended that training AI models on copyrighted texts for the purpose of learning patterns and generating new content fell within the bounds of transformative fair use. They also argued that the plaintiffs failed to demonstrate concrete harm or that the AI outputs constituted infringing derivative works.
The litigation represents a landmark test case for how copyright law applies to AI training data, with implications for the entire generative AI industry. The outcomes could establish important precedents regarding whether using copyrighted works to train AI models requires explicit permission from rights holders, or whether such use qualifies as fair use. As of late 2023, the cases remained in the discovery phase, with both sides gathering evidence about training data sources and methodologies.
Root Cause
AI companies allegedly used copyrighted books from Library Genesis and other sources to train large language models without obtaining proper licenses or permissions from copyright holders.
Mitigation Analysis
Proper data provenance tracking and licensing compliance could have prevented this issue. Companies should implement robust copyright clearance processes before using training data, maintain detailed records of data sources, and establish legal review procedures for training datasets. Automated content filtering to exclude known copyrighted works and partnerships with publishers for legitimate licensing would reduce infringement risks.
Lessons Learned
This case highlights the urgent need for clear legal frameworks governing AI training data and the intersection of copyright law with machine learning. It demonstrates the importance of proactive data licensing and the potential risks of using datasets with unclear provenance.
Sources
Sarah Silverman sues OpenAI and Meta for copyright infringement
The Guardian · Jul 10, 2023 · news
Sarah Silverman, other authors sue OpenAI, Meta for copyright infringement
Reuters · Jul 10, 2023 · news