← Back to incidents

OpenAI Whisper Hallucinations in Medical Transcriptions Pose Patient Safety Risks

High

University of Michigan researchers found OpenAI's Whisper speech recognition tool hallucinated in 1% of transcriptions, creating fabricated medical content in healthcare settings where it was used for clinical documentation.

Category
Hallucination
Industry
Healthcare
Status
Reported
Date Occurred
Jan 1, 2024
Date Reported
Oct 13, 2024
Jurisdiction
US
AI Provider
OpenAI
Model
Whisper
Application Type
api integration
Harm Type
operational
Human Review in Place
No
Litigation Filed
No
medical_aispeech_recognitionclinical_documentationpatient_safetytranscription_accuracy

Full Description

In October 2024, researchers from the University of Michigan published findings revealing that OpenAI's Whisper speech recognition tool systematically generated hallucinated content in approximately 1% of audio transcriptions. The research, led by computer scientist Allison Koenecke, analyzed over 13,000 audio samples and found that Whisper would fabricate entire phrases, sentences, and sometimes paragraphs when faced with unclear or silent audio segments, rather than indicating uncertainty or transcription failure. The hallucinations were particularly concerning because they often included violent imagery, racial commentary, and fabricated medical terminology. In healthcare settings, where Whisper had been deployed by multiple companies for clinical documentation and patient note transcription, these hallucinations posed direct patient safety risks. The AI would generate plausible-sounding medical content that could influence treatment decisions, medication prescriptions, or diagnostic assessments if clinicians relied on the transcribed text without verification. Researchers identified specific patterns in the hallucinations, noting that the model was more likely to generate fabricated content when processing audio with background noise, multiple speakers, or technical terminology. The hallucinated text often appeared grammatically correct and contextually relevant, making it difficult for users to identify as artificially generated content. This was especially dangerous in medical contexts where accuracy of documentation is critical for patient care and legal compliance. Despite these findings, OpenAI's response was limited, with the company acknowledging the research but not implementing immediate safeguards or warnings for medical applications. The company's documentation did include general warnings about potential inaccuracies, but no specific guidance was provided for high-risk applications like healthcare. Several companies that had integrated Whisper into medical workflow systems were forced to reassess their quality control procedures and implement additional verification steps. The incident highlighted broader concerns about deploying AI transcription tools in critical applications without adequate safeguards. Medical professionals and healthcare technology vendors began implementing additional review processes, but the widespread deployment of Whisper in healthcare settings meant that potentially inaccurate transcriptions had already entered patient records across numerous healthcare systems.

Root Cause

Whisper's speech recognition model exhibited systematic hallucination patterns when processing unclear audio, generating plausible-sounding but entirely fabricated content including violent imagery and medical terminology instead of acknowledging transcription uncertainty.

Mitigation Analysis

Mandatory human review of all AI transcriptions, confidence scoring with thresholds for flagging uncertain segments, and audio quality validation before processing could prevent fabricated medical content. Real-time uncertainty detection and explicit marking of low-confidence transcriptions would alert clinicians to potential inaccuracies.

Lessons Learned

AI transcription tools require specialized validation and human oversight in high-stakes applications like healthcare, where fabricated content can directly impact patient safety and care quality. The incident demonstrates the need for confidence scoring, uncertainty detection, and mandatory review processes in medical AI deployments.