Authors Guild and Major Authors Sue OpenAI for Copyright Infringement Over ChatGPT Training

High

The Authors Guild and 17 prominent authors sued OpenAI in September 2023, alleging ChatGPT was trained on their copyrighted books without permission. The class action lawsuit seeks damages and injunctive relief, raising fundamental questions about fair use in AI training.

Full Description

On September 20, 2023, the Authors Guild filed a class action lawsuit in the Southern District of New York against OpenAI, alleging systematic copyright infringement in the training of ChatGPT. The lawsuit was brought on behalf of 17 prominent authors including bestselling novelists John Grisham, George R.R. Martin, Jodi Picoult, Michael Crichton's estate, and others. The plaintiffs alleged that OpenAI systematically copied and used their copyrighted works without permission, consent, or compensation to train ChatGPT's large language models. The complaint detailed how OpenAI allegedly used millions of copyrighted books, including the plaintiffs' works, obtained from 'shadow libraries' and other unauthorized sources to train ChatGPT. The authors argued that ChatGPT could produce summaries, analyses, and derivative content based on their copyrighted works, demonstrating that the AI system had ingested and processed their intellectual property. The lawsuit cited specific examples where ChatGPT could accurately summarize plots and characters from the authors' books, which the plaintiffs argued proved their works were used in training. The Authors Guild sought both monetary damages and injunctive relief, requesting that the court order OpenAI to cease using copyrighted works for training purposes and to destroy models trained on infringing material. The lawsuit represented one of the most significant legal challenges to the AI industry's practice of training models on vast datasets that include copyrighted content. The plaintiffs argued that OpenAI's use of their works did not qualify as fair use under copyright law, as it was commercial in nature and could harm the market for their original works. OpenAI's defense strategy centered on fair use arguments, contending that training AI models on copyrighted works constitutes transformative use protected under copyright law. The company argued that the training process does not create copies but rather uses the works to teach the AI system patterns and language structures. OpenAI maintained that ChatGPT does not reproduce copyrighted works verbatim but generates original content based on learned patterns. The case became a bellwether for the broader legal questions surrounding AI training data and copyright law. The lawsuit had significant implications beyond the immediate parties, as it challenged the foundational practices of the AI industry. Many major AI companies, including Google, Meta, and Anthropic, rely on similar training methodologies using large datasets that likely include copyrighted material. The case's outcome could establish precedents affecting how AI companies acquire and use training data, potentially requiring extensive licensing agreements with content creators and publishers. Legal experts noted that the case could either validate current AI training practices under fair use doctrine or force the industry to fundamentally restructure how models are trained. The litigation remained pending as of late 2023, with significant attention from both the tech industry and creative communities. The case represented part of a broader wave of copyright challenges against AI companies, including similar lawsuits by visual artists and other content creators. The Authors Guild emphasized that the lawsuit was not anti-technology but sought to ensure that creators are compensated when their works are used to build profitable AI systems.

Root Cause

OpenAI allegedly trained ChatGPT on copyrighted literary works without obtaining proper licensing or permission from authors, potentially using pirated copies from shadow libraries. The training process involved ingesting and processing millions of copyrighted texts to develop the language model's capabilities.

Mitigation Analysis

Robust content provenance tracking could have identified copyrighted materials in training datasets. Pre-training legal review and licensing protocols would have prevented unauthorized use of copyrighted works. Implementation of content filtering systems to exclude copyrighted material without explicit permission, combined with automated detection of potentially infringing training data, could have avoided this litigation.

Regulatory Framework References

All frameworks →

EU AI Act

Art. 53—Obligations for General-Purpose AI Models

ISO/IEC 42001

A.5.4—Legal Compliance

NIST AI RMF

GOVERN 1.2—Legal Compliance

Lessons Learned

The case highlights the urgent need for clear legal frameworks governing AI training on copyrighted content and demonstrates the importance of proactive licensing strategies for AI companies. It underscores the tension between technological innovation and intellectual property rights in the AI era.

Sources

Authors including Grisham, Martin sue OpenAI for copyright infringement

Reuters · Sep 20, 2023 · news

Authors file class action suit against OpenAI for 'systematic theft'

The Guardian · Sep 20, 2023 · news