← Back to incidents

New York Times Sues OpenAI and Microsoft for Copyright Infringement in AI Training Data

High

The New York Times filed a landmark federal lawsuit against OpenAI and Microsoft in December 2023, alleging copyright infringement for using millions of NYT articles to train GPT models without permission, potentially setting precedent for AI training data rights.

Category
Copyright Violation
Industry
Media
Status
Litigation Pending
Date Occurred
Dec 27, 2023
Date Reported
Dec 27, 2023
Jurisdiction
US
AI Provider
OpenAI
Model
GPT models including GPT-4
Application Type
other
Harm Type
financial
Human Review in Place
Unknown
Litigation Filed
Yes
Litigation Status
pending
copyrighttraining_datamedialicensingfair_usejournalismprecedent

Full Description

On December 27, 2023, The New York Times filed a federal lawsuit in the Southern District of New York against OpenAI and Microsoft, alleging widespread copyright infringement in the training of artificial intelligence models. The complaint argues that both companies used millions of published articles from The Times without authorization to develop their large language models, including GPT-3.5, GPT-4, and other systems integrated into Microsoft's products such as Bing Chat and Microsoft Copilot. The lawsuit seeks unspecified monetary damages and injunctive relief to prevent further unauthorized use of Times content in AI training and output. The lawsuit centers on allegations that OpenAI's GPT models and Microsoft's AI systems were trained on extensive datasets containing copyrighted New York Times articles spanning decades of journalism. The Times alleges that the defendants' AI systems can reproduce Times content verbatim, recite Times articles at length, and closely mimic the newspaper's distinctive writing style and journalistic methodology. The complaint includes specific examples where ChatGPT allegedly reproduced entire paragraphs and sections from identifiable Times articles when prompted with certain queries, demonstrating what the plaintiff characterizes as systematic memorization of copyrighted content. The New York Times argues that this unauthorized use directly undermines its subscription-based business model by creating a substitute for its journalism without providing compensation to the publisher. The lawsuit contends that users increasingly rely on AI-generated summaries and responses instead of visiting The Times' website or subscribing to its digital services, potentially causing measurable harm to traffic, subscriber acquisition, and advertising revenue. The complaint emphasizes that The Times invests hundreds of millions of dollars annually in original journalism and that the defendants are essentially free-riding on this investment while competing directly with the newspaper's core business. In response to the lawsuit, OpenAI issued public statements expressing disappointment with the legal action and asserting that their training practices constitute fair use under copyright law. The company maintained that it had been engaged in productive conversations with The New York Times about potential partnership opportunities and licensing arrangements. Microsoft similarly defended its AI development practices and expressed confidence in its legal position, while both companies indicated they would vigorously contest the allegations in court. The case represents a landmark legal challenge that could fundamentally reshape how AI companies approach training data acquisition and usage rights. Legal experts widely view this lawsuit as a critical test of whether existing fair use doctrines adequately address the novel challenges posed by large-scale AI training on copyrighted materials. The litigation has prompted other major news organizations and content creators to reassess their own potential claims against AI companies, with several publishers reportedly considering similar legal action. The broader implications extend beyond the immediate parties, as the outcome could establish binding precedent for the entire AI industry's reliance on publicly available internet content for model training. If The New York Times prevails, AI companies may be required to negotiate extensive licensing agreements with content creators and publishers, potentially increasing development costs significantly and altering the competitive landscape for AI development. The case also raises questions about the future of journalism funding models and whether AI companies should be required to compensate content creators whose work contributes to AI model capabilities.

Root Cause

OpenAI and Microsoft allegedly used extensive copyrighted content from The New York Times to train their large language models without obtaining proper licensing agreements or permission from the copyright holder.

Mitigation Analysis

Content provenance tracking and licensing verification systems during training data collection could have prevented this issue. Implementation of opt-out mechanisms for publishers and proactive licensing agreements with content creators would address copyright concerns. Regular audits of training datasets for copyrighted material and establishment of clear content usage policies could reduce legal exposure.

Lessons Learned

This case highlights the urgent need for AI companies to establish clear legal frameworks for training data acquisition and to proactively engage with content creators on licensing terms. The lawsuit demonstrates that existing fair use defenses may not adequately protect AI companies from copyright claims, particularly when their models can reproduce substantial portions of copyrighted works.