GitHub Copilot Code Generation Reproduces Copyrighted Code Verbatim

High

GitHub Copilot was found to reproduce copyrighted code verbatim from its training data, leading to a class action lawsuit alleging copyright infringement by the AI coding assistant.

Full Description

GitHub Copilot, launched in June 2021 as an AI-powered coding assistant developed by GitHub in partnership with OpenAI, faced significant legal challenges when researchers and developers discovered it could reproduce substantial portions of copyrighted code from its training dataset. The tool, built on OpenAI's Codex model, was trained on billions of lines of public source code from GitHub repositories, including code under various licenses with different terms and restrictions. In late 2021 and early 2022, multiple instances emerged where GitHub Copilot generated code suggestions that were nearly identical to existing copyrighted works, including complete functions with original comments, variable names, and in some cases even license headers. Notably, researchers found that Copilot could reproduce the fast inverse square root algorithm from Quake III Arena, complete with the original comments, and sections of code from various open-source projects protected under GPL and other copyleft licenses. On November 3, 2022, programmer and lawyer Matthew Butterick, along with the Joseph Saveri Law Firm, filed a class action lawsuit against GitHub, Microsoft, and OpenAI in the U.S. District Court for the Northern District of California. The lawsuit, Doe v. GitHub Inc. et al., alleges that the defendants violated the Digital Millennium Copyright Act (DMCA), breached open-source licenses, and engaged in unfair competition by using copyrighted code without permission and without preserving required copyright notices and license terms. The plaintiffs argue that GitHub Copilot effectively launders copyrighted code by stripping away license requirements and attribution, potentially exposing users to copyright infringement liability. The lawsuit seeks damages and injunctive relief to prevent further copyright violations. GitHub and Microsoft have defended their position, arguing that training AI models on publicly available code constitutes fair use and that Copilot generates original code inspired by patterns rather than copying existing works. The case remains pending and could establish important precedents for AI training on copyrighted materials and the liability of AI service providers for copyright infringement by their models.

Root Cause

GitHub Copilot's training dataset included copyrighted source code from public repositories, and the model learned to reproduce substantial portions of this code verbatim rather than generating original code inspired by patterns.

Mitigation Analysis

Code provenance tracking to identify when suggestions match existing copyrighted works, filtering mechanisms to exclude verbatim reproduction of substantial code blocks, and human review workflows requiring developers to verify the originality of AI-generated code could have prevented copyright violations.

Regulatory Framework References

All frameworks →

EU AI Act

Art. 53—Obligations for General-Purpose AI Models

ISO/IEC 42001

A.5.4—Legal Compliance

NIST AI RMF

GOVERN 1.2—Legal Compliance

Lessons Learned

The case highlights critical questions about fair use in AI training, the need for robust filtering mechanisms to prevent verbatim reproduction of copyrighted materials, and the importance of clear legal frameworks governing AI-generated content that may infringe on existing intellectual property rights.

Sources

GitHub Copilot hit with copyright lawsuit from programmer claiming AI stole his code

The Verge · Nov 8, 2022 · news

GitHub Copilot Litigation - Class Action Lawsuit

Joseph Saveri Law Firm · Nov 3, 2022 · court filing