← Back to incidents
GitHub Copilot Code Generation Reproduces Copyrighted Code Verbatim
HighGitHub Copilot was found to reproduce copyrighted code verbatim from its training data, leading to a class action lawsuit alleging copyright infringement by the AI coding assistant.
Category
Copyright Violation
Industry
Technology
Status
Litigation Pending
Date Occurred
Jan 1, 2022
Date Reported
Nov 3, 2022
Jurisdiction
US
AI Provider
Other/Unknown
Model
GitHub Copilot
Application Type
copilot
Harm Type
legal
Human Review in Place
No
Litigation Filed
Yes
Litigation Status
pending
copyrightcode_generationfair_usetraining_dataintellectual_propertyclass_actiondeveloper_tools
Full Description
GitHub Copilot, launched in June 2021 as an AI-powered coding assistant developed by GitHub in partnership with OpenAI, faced significant legal challenges when researchers and developers discovered it could reproduce substantial portions of copyrighted code from its training dataset. The tool, built on OpenAI's Codex model, was trained on billions of lines of public source code from GitHub repositories, including code under various licenses with different terms and restrictions.
In late 2021 and early 2022, multiple instances emerged where GitHub Copilot generated code suggestions that were nearly identical to existing copyrighted works, including complete functions with original comments, variable names, and in some cases even license headers. Notably, researchers found that Copilot could reproduce the fast inverse square root algorithm from Quake III Arena, complete with the original comments, and sections of code from various open-source projects protected under GPL and other copyleft licenses.
On November 3, 2022, programmer and lawyer Matthew Butterick, along with the Joseph Saveri Law Firm, filed a class action lawsuit against GitHub, Microsoft, and OpenAI in the U.S. District Court for the Northern District of California. The lawsuit, Doe v. GitHub Inc. et al., alleges that the defendants violated the Digital Millennium Copyright Act (DMCA), breached open-source licenses, and engaged in unfair competition by using copyrighted code without permission and without preserving required copyright notices and license terms.
The plaintiffs argue that GitHub Copilot effectively launders copyrighted code by stripping away license requirements and attribution, potentially exposing users to copyright infringement liability. The lawsuit seeks damages and injunctive relief to prevent further copyright violations. GitHub and Microsoft have defended their position, arguing that training AI models on publicly available code constitutes fair use and that Copilot generates original code inspired by patterns rather than copying existing works. The case remains pending and could establish important precedents for AI training on copyrighted materials and the liability of AI service providers for copyright infringement by their models.
Root Cause
GitHub Copilot's training dataset included copyrighted source code from public repositories, and the model learned to reproduce substantial portions of this code verbatim rather than generating original code inspired by patterns.
Mitigation Analysis
Code provenance tracking to identify when suggestions match existing copyrighted works, filtering mechanisms to exclude verbatim reproduction of substantial code blocks, and human review workflows requiring developers to verify the originality of AI-generated code could have prevented copyright violations.
Lessons Learned
The case highlights critical questions about fair use in AI training, the need for robust filtering mechanisms to prevent verbatim reproduction of copyrighted materials, and the importance of clear legal frameworks governing AI-generated content that may infringe on existing intellectual property rights.
Sources
GitHub Copilot hit with copyright lawsuit from programmer claiming AI stole his code
The Verge · Nov 8, 2022 · news
GitHub Copilot Litigation - Class Action Lawsuit
Joseph Saveri Law Firm · Nov 3, 2022 · court filing