← Back to incidents

GitHub Copilot Reproduced Copyrighted Code Verbatim Leading to Class Action Lawsuit

High

GitHub Copilot reproduced copyrighted code verbatim from its training data, leading to a class action lawsuit alleging widespread copyright infringement and violation of open-source licenses.

Category
Copyright Violation
Industry
Technology
Status
Litigation Pending
Date Occurred
Jun 1, 2022
Date Reported
Nov 3, 2022
Jurisdiction
US
AI Provider
Other/Unknown
Model
GitHub Copilot
Application Type
copilot
Harm Type
legal
People Affected
100,000
Human Review in Place
No
Litigation Filed
Yes
Litigation Status
pending
copyrightopen-sourcelicensingverbatim-reproductionfair-useDMCAcode-generation

Full Description

GitHub Copilot, launched in June 2022 as a commercial AI-powered coding assistant, was trained on billions of lines of public source code from GitHub repositories. The system was designed to suggest code completions and generate functions based on natural language prompts. However, researchers and developers quickly discovered that Copilot was reproducing substantial portions of copyrighted code verbatim, including exact function implementations, comments, and even license headers from the original source code. On November 3, 2022, lawyer Matthew Butterick, along with the Joseph Saveri Law Firm, filed a class action lawsuit against GitHub, Microsoft, and OpenAI in the Northern District of California. The lawsuit, Doe v. GitHub Inc., alleged that Copilot constituted massive copyright infringement by reproducing licensed code without attribution or compliance with license terms. The plaintiffs argued that Copilot violated the Digital Millennium Copyright Act (DMCA) section 1202 by removing copyright management information, and breached various open-source licenses including GPL, Apache, and MIT licenses that require attribution. Specific documented instances included Copilot reproducing the entire quake_rsqrt function from Quake III Arena source code, complete with original comments, and generating substantial portions of code from projects like NumPy with identical variable names and structure. Security researchers demonstrated that with carefully crafted prompts, Copilot could be induced to reproduce hundreds of lines of copyrighted code from well-known projects. The Electronic Frontier Foundation and other digital rights organizations raised concerns about the broader implications for fair use in AI training. The lawsuit exposed deep divisions within the open-source community. While some developers viewed Copilot as beneficial for productivity and code discovery, others argued it fundamentally violated the social contract of open-source licensing. GitHub and Microsoft defended Copilot under fair use doctrine, arguing that the AI's suggestions constituted transformative use and that most suggestions were not substantially similar to training data. They implemented filters to reduce verbatim reproduction and added features to detect potential matches with public code, though critics argued these measures were insufficient and reactive rather than preventive.

Root Cause

GitHub Copilot was trained on billions of lines of public code from repositories without proper filtering for license restrictions, causing the model to memorize and reproduce substantial portions of copyrighted code including exact function implementations and comments.

Mitigation Analysis

Provenance tracking systems could identify when suggestions match existing copyrighted code. License-aware filtering during training could exclude restrictively licensed code. Real-time code similarity detection could flag potential copyright violations before code is suggested to users.

Lessons Learned

The incident highlighted fundamental tensions between AI training practices and intellectual property law, particularly in open-source software where licensing terms require attribution and compliance. It demonstrated the need for AI companies to develop sophisticated content filtering and attribution systems when training on licensed materials.

Sources

GitHub Copilot Litigation
Joseph Saveri Law Firm · Nov 3, 2022 · court filing
GitHub Copilot and the Problem of FOSS License Compliance
Electronic Frontier Foundation · Feb 16, 2022 · academic paper
GitHub's response to copyright concerns
GitHub · Dec 1, 2022 · company statement