← Back to incidents

OpenAI Faces Class Action Lawsuit for Training Models on Private Medical Records Without Consent

High

A 2023 class action lawsuit alleged OpenAI trained its language models on private medical records and therapy notes scraped from the internet without patient consent. The case highlights significant privacy risks in AI training data practices within healthcare contexts.

Category
Privacy Leak
Industry
Healthcare
Status
Litigation Pending
Date Occurred
Jan 1, 2023
Date Reported
Jun 28, 2023
Jurisdiction
US
AI Provider
OpenAI
Model
GPT-3.5
Application Type
api integration
Harm Type
privacy
Human Review in Place
No
Litigation Filed
Yes
Litigation Status
pending
medical_recordsprivacy_violationHIPAAtraining_dataclass_actionhealthcare_privacyconsentdata_scraping

Full Description

In June 2023, a class action lawsuit was filed against OpenAI in federal court, alleging the company violated privacy laws by training its artificial intelligence models on private medical records, therapy session notes, and other sensitive health information without obtaining proper consent from patients. The lawsuit claimed that OpenAI's data collection practices included scraping protected health information from various internet sources, including medical websites, patient forums, and healthcare platforms. The plaintiffs argued that OpenAI's training methodology violated the Health Insurance Portability and Accountability Act (HIPAA) and various state privacy laws by using individually identifiable health information without authorization. The complaint alleged that the company failed to implement adequate safeguards to identify and exclude protected health information from its massive training datasets, which were used to develop models including GPT-3.5 and potentially GPT-4. The lawsuit highlighted concerns about the permanence of this privacy violation, noting that once private medical information is incorporated into an AI model's training data, it becomes extremely difficult or impossible to remove. This creates ongoing risks of sensitive health information being inadvertently exposed through model outputs or being exploited by malicious actors who might attempt to extract training data through sophisticated prompting techniques. OpenAI faced significant reputational damage from the allegations, as the case drew attention to broader issues of consent and transparency in AI training practices. The lawsuit sought damages for affected individuals and injunctive relief requiring OpenAI to implement stronger privacy protections and data governance practices. The case also raised questions about the responsibility of AI companies to verify the legal status of training data, particularly when dealing with sensitive categories of information like health records. The litigation underscored the growing tension between AI companies' need for vast amounts of training data and individuals' privacy rights, particularly in sensitive domains like healthcare. Legal experts noted that the case could set important precedents for how courts interpret existing privacy laws in the context of AI training practices and whether current regulations are adequate to protect individuals from unauthorized use of their personal information in machine learning systems.

Root Cause

OpenAI allegedly scraped and used private medical data from internet sources for model training without implementing adequate safeguards to identify and exclude protected health information.

Mitigation Analysis

Comprehensive data provenance tracking and automated detection systems could have identified protected health information before inclusion in training datasets. Implementing strict data governance policies with healthcare-specific privacy review processes and obtaining explicit consent for any medical data usage would have prevented this violation. Regular auditing of training data sources and compliance with healthcare privacy regulations like HIPAA should be mandatory for AI companies processing potentially sensitive information.

Lessons Learned

The case demonstrates that AI companies must implement robust data governance frameworks specifically designed to identify and exclude protected categories of information from training datasets. It also highlights the need for clearer regulatory guidance on consent requirements for AI training data, particularly in sensitive domains like healthcare.