OpenAI, a leading technology company, has promoted its AI-driven transcription tool, Whisper, as having a level of robustness and accuracy similar to that of humans. However, interviews with more than a dozen software engineers, developers, and academic researchers have revealed a significant flaw in Whisper: it has a tendency to fabricate large portions of text or even entire sentences. These fabricated sections, referred to as “hallucinations” in the industry, may include racial commentary, violent rhetoric, or imaginary medical treatments.
Experts highlight that these inaccuracies pose a problem since Whisper is widely used across various industries to translate and transcribe interviews, generate text for consumer technology, and create video subtitles. Of particular concern is the rapid adoption of Whisper-based tools by healthcare facilities for transcribing doctors’ consultations with patients. This practice persists despite OpenAI’s advisories against using the tool in “high-risk domains.”
Although the full extent of this issue is challenging to quantify, researchers and engineers consistently encounter hallucinations in their work with Whisper. For instance, a University of Michigan researcher studying public meetings reported finding hallucinations in 8 out of 10 audio transcriptions he examined before attempting to enhance the model. Similarly, a machine learning engineer discovered hallucinations in about half of the over 100 hours of transcriptions he analyzed, and another developer found them in nearly all 26,000 transcripts he produced using Whisper.
These issues arise even in high-quality, short audio recordings. A recent study by computer scientists found 187 hallucinations in over 13,000 clear audio snippets analyzed. Researchers warn that this could result in tens of thousands of erroneous transcriptions across millions of recordings. The potential consequences of such inaccuracies are dire, particularly in healthcare settings, according to Alondra Nelson, who served as the head of the White House Office of Science and Technology Policy under the Biden administration.
Whisper is also utilized to create closed captioning for individuals who are Deaf or hard of hearing—an application that could place this population at risk due to potentially undetectable inaccuracies, as noted by Christian Vogler, a director at Gallaudet University’s Technology Access Program.
In light of these issues, experts, advocates, and former OpenAI staff have suggested that regulatory measures for AI might be necessary, and they urge OpenAI to address the problem. William Saunders, a research engineer who left OpenAI over concerns about the company’s direction, believes it is critical to address these flaws to avoid misplaced confidence in the tool’s reliability.
An OpenAI spokesperson has stated that the company is actively researching ways to minimize hallucinations and values the input from researchers, integrating such feedback into model updates. Despite their expectations that transcription tools might make errors like misspellings, experts assert they have never encountered another AI-powered transcription tool with Whisper’s level of hallucination.
Whisper is incorporated into some versions of OpenAI’s flagship chatbot ChatGPT and is part of Oracle and Microsoft’s cloud services, which are used by thousands of companies worldwide. It is employed to transcribe and translate text into various languages and remains highly popular on platforms like HuggingFace. Machine-learning engineer Sanchit Gandhi confirmed its prevalence as an open-source speech recognition model utilized in numerous applications, from call centers to voice assistants.
Researchers Allison Koenecke from Cornell University and Mona Sloane from the University of Virginia analyzed thousands of short snippets from the TalkBank repository, hosted by Carnegie Mellon University. They identified that almost 40% of Whisper’s hallucinations were concerning because of the potential for misinterpretation. In some instances, Whisper invented violent actions, racial commentary, or non-existent medications.
While the precise reason for these hallucinations remains unclear, it is suggested that they may occur amid pauses, background sounds, or music. OpenAI advises against using Whisper in scenarios where inaccuracies could have damaging outcomes. Despite these cautions, some hospitals and medical facilities continue to use Whisper for transcribing doctor-patient interactions to reduce the time clinicians spend on documentation.
Over 30,000 clinicians and 40 health systems, including Mankato Clinic in Minnesota and Children’s Hospital Los Angeles, have adopted a Whisper-based tool developed by Nabla. This tool, designed for medical language transcription, is aware of potential hallucinations and is working on mitigation strategies. However, there are concerns about the inability to compare AI-generated transcripts to original recordings, given that Nabla’s tool deletes the audio for data security reasons. Saunders warns that erasing the original audio could impede error detection when transcripts are not cross-verified.
Privacy concerns also arise with the use of AI for transcribing medical appointments. A California state legislator, Rebecca Bauer-Kahan, opted not to allow her child’s consultation audio to be shared with outside vendors, including Microsoft Azure’s cloud services, due to privacy concerns. John Muir Health’s spokesman, Ben Drew, responded that the health system adheres to state and federal privacy laws.