OpenAI Whisper Speech Recognition Model Hallucinated False Content Including Racial Slurs

Medium

OpenAI's Whisper speech-to-text model was found to hallucinate entire phrases including racial slurs and violent content that were never spoken, affecting transcriptions used in hospitals and courts.

Full Description

In October 2024, researchers at Cornell University and other institutions published findings revealing systematic hallucination issues in OpenAI's Whisper automatic speech recognition model. The study found that Whisper frequently generated entirely fabricated text, including racial slurs, violent language, and other harmful content that was never actually spoken in the source audio. The hallucinations occurred most frequently when Whisper processed silent segments, unclear speech, or audio with background noise. Instead of indicating uncertainty or producing blank output, the model confidently generated plausible-sounding but completely false transcriptions. Researchers found that approximately 1% of audio samples contained hallucinated content, with some samples producing multiple fabricated phrases. The issue was particularly concerning because Whisper has been widely deployed in sensitive applications including hospitals for medical transcription and courts for legal proceedings. In healthcare settings, false transcriptions could potentially contaminate patient records with inappropriate or misleading content. In legal contexts, fabricated statements could affect case outcomes or create liability issues for court reporting services. Researchers tested multiple versions of Whisper and found the hallucination problem persisted across different model sizes and configurations. The fabricated content often included coherent sentences that appeared realistic, making the errors difficult to detect without careful human review. Some hallucinations contained profanity, racial slurs, or references to violence that could be particularly damaging if included in official transcripts. OpenAI acknowledged the issue and stated they were working on improvements, but did not provide a timeline for fixes. The company recommended users implement human review processes for critical applications, though many deployments lacked such safeguards. Industry experts called for better uncertainty quantification and content filtering to prevent similar issues in speech recognition systems used for sensitive applications.

Root Cause

Whisper's neural network architecture generated plausible-sounding but entirely fabricated text when processing unclear or silent audio segments, rather than indicating uncertainty or producing blank output.

Mitigation Analysis

Implementation of confidence scoring thresholds, mandatory human review for critical applications like medical records and legal proceedings, and automated detection of potentially offensive content could have prevented harmful false transcriptions. Real-time uncertainty indicators and audio quality assessment would help flag problematic segments.

Regulatory Framework References

All frameworks →

EU AI Act

Art. 9—Risk Management SystemArt. 13—Transparency & InformationArt. 14—Human Oversight

ISO/IEC 42001

6.1.2—AI Risk AssessmentA.6.2.4—Documentation of AI System Performance

NIST AI RMF

MEASURE 2.5—AI System AccuracyGOVERN 1.2—Trustworthy AI Characteristics

Lessons Learned

AI systems used in high-stakes environments like healthcare and legal proceedings require robust uncertainty quantification and mandatory human oversight. Speech recognition models need better handling of ambiguous audio inputs to prevent confident but false outputs.

Sources

Researchers find OpenAI's Whisper tool creates fabricated text in medical transcripts

Associated Press · Oct 14, 2024 · news

Towards Robust Speech Representation Learning for Thousands of Languages

arXiv · Sep 26, 2024 · academic paper