A discrepancy has been identified between the benchmark results conducted by OpenAI and independent evaluations of OpenAI’s o3 AI model, leading to scrutiny regarding their transparency and testing practices. OpenAI, when announcing the o3 model in December, asserted that the model was capable of answering over 25% of questions on FrontierMath, a challenging math problem set, outperforming its closest competitor, which solved only about 2% of the problems.
Mark Chen, OpenAI’s chief research officer, stated during a livestream that while most current offerings achieve less than 2% on FrontierMath, the o3 model under aggressive testing conditions exceeded 25%. However, this score likely represents an upper bound, achieved through a version of the o3 model with more computational power than the version released to the public by OpenAI last week.
Epoch AI, the organization behind FrontierMath, released its independent benchmark results for o3, revealing a performance score of approximately 10%, falling short of OpenAI’s highest claim. Despite this, OpenAI’s initially published results included a lower-bound score aligning with Epoch’s findings. Differences in the testing setup and updates to the FrontierMath problem set used by Epoch could account for the discrepancies.
The ARC Prize Foundation, which tested a pre-release version of o3, suggested that the version made publicly available differs from the one it evaluated, having been optimized for chat and product use. This was supported by Wenda Zhou, a member of OpenAI’s technical team, who stated in a recent livestream that the production version of o3 was optimized for practical applications and speed, potentially leading to observed benchmark variations.
Although the publicly released version of o3 does not meet OpenAI’s initial testing expectations, the newer models o3-mini-high and o4-mini are reported to outperform it on FrontierMath. OpenAI intends to introduce a more advanced variant, o3-pro, soon.
The situation highlights a broader issue where AI benchmarking, especially when conducted by the companies selling these models, should be interpreted with caution. The AI industry has seen similar controversies, like Epoch’s delayed funding disclosure from OpenAI after the o3 announcement, and accusations against Elon Musk’s xAI and Meta for misleading benchmark claims with their respective AI models.
Ultimately, this case underscores the necessity for transparency and consistency in AI benchmarking amidst the competitive race for market leadership in the industry.