OpenAI has recently introduced the o3 and o4-mini AI models, which are advanced in many aspects. However, these new models tend to experience more hallucinations, or fabricated outputs, than several of OpenAI’s earlier models.
The phenomenon of hallucinations remains a significant challenge in artificial intelligence, affecting even the most efficient systems currently available. Each new model historically showed incremental improvement in reducing hallucinations, typically presenting less than its predecessor. However, this trend does not appear to apply to the o3 and o4-mini models.
Internal testing by OpenAI reveals that the o3 and o4-mini models, categorized as reasoning models, manifest hallucinations more frequently compared to the company’s preceding reasoning models such as o1, o1-mini, and o3-mini, as well as traditional models like GPT-4o.
OpenAI does not fully understand why these incidents are occurring. Their technical documentation indicates the need for further research to uncover why hallucinations have increased as reasoning models scale up. The o3 and o4-mini models have demonstrated superior performance in specific areas like coding and mathematics tasks. However, because they make more assertions overall, they end up generating more both accurate and inaccurate or hallucinated claims, according to the report.
OpenAI’s investigation found that the o3 model hallucinated in response to 33% of questions on PersonQA, their in-house benchmark for assessing the accuracy of a model’s understanding of people. This rate is approximately double that of OpenAI’s previous reasoning models o1 and o3-mini, which showed hallucination rates of 16% and 14.8%, respectively. O4-mini performed even worse, with a 48% hallucination rate on PersonQA.
Similarly, third-party testing by Transluce, a nonprofit AI research lab, found that the o3 model often fabricated actions in the process of generating some answers. For instance, o3 was observed claiming to have run code on a 2021 MacBook Pro “outside of ChatGPT” and then incorporated the results into its answer. However, the o3 does not have the capability to perform such tasks.
Neil Chowdhury, a Transluce researcher and former OpenAI employee, suggested via email to TechCrunch that the type of reinforcement learning applied to the o-series models could intensify problems generally mitigated but not entirely resolved by usual post-training processes.
Sarah Schwettmann, co-founder of Transluce, noted that the hallucination rate of o3 might restrict its usefulness. Meanwhile, Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, mentioned his team is testing o3 in coding workflows and found it superior to the competition, though it sometimes produces broken website links.
While hallucinations can assist models in generating creative ideas, they complicate usage in fields where accuracy is crucial, such as law firms, which demand precision in their documents.
To enhance model accuracy, integrating web search capabilities is a promising approach. OpenAI’s GPT-4o with web search reaches 90% accuracy on SimpleQA, another of OpenAI’s accuracy assessments. Potentially, search functions could improve the hallucination rates in reasoning models as well, provided users agree to involve a third-party search service.
If scaling reasoning models continues to exacerbate hallucinations, finding a solution will become increasingly critical. OpenAI spokesperson Niko Felix stated to TechCrunch that addressing hallucinations across all models remains an ongoing research focus, and efforts are continually made to enhance their accuracy and reliability.
Over the past year, the AI industry has shifted towards focusing on reasoning models after traditional AI models’ improvement techniques began yielding diminishing returns. Reasoning boosts model performance on various tasks without requiring extensive computing resources and data during training. However, it appears reasoning might also increase hallucinations, posing a challenge.