Debates surrounding AI benchmarks and their reporting by AI laboratories have increasingly entered public discourse. Recently, an employee from OpenAI accused Elon Musk’s AI company, xAI, of disseminating misleading benchmark results for its latest model, Grok 3. However, Igor Babushkin, a co-founder of xAI, asserted that the company acted appropriately.
A resolution to this issue appears to be somewhere in the middle. On xAI’s blog, the company presented a graph illustrating Grok 3’s performance on AIME 2025—a set of challenging math questions taken from a recent invitational mathematics exam. Although some experts have questioned the validity of AIME as an AI benchmark, it remains widely employed to assess a model’s mathematical capabilities.
The graph provided by xAI displayed two versions of Grok 3, namely Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperforming OpenAI’s top available model, o3-mini-high, on AIME 2025. Nonetheless, OpenAI employees promptly noted that xAI’s graph did not account for o3-mini-high’s score at “cons@64.”
“Cons@64,” short for “consensus@64,” allows a model 64 attempts to solve each problem in a benchmark, with the most frequently generated answer being deemed correct. This method often significantly boosts benchmark scores, and its omission from the graph might lead to a misleading comparison between models.
In terms of the AIME 2025 benchmark at “@1,” which considers the initial score models achieved, both Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than o3-mini-high. Additionally, Grok 3 Reasoning Beta slightly lagged behind OpenAI’s o1 model set to “medium” computing. Nevertheless, xAI is promoting Grok 3 as the “world’s smartest AI.”
Babushkin contended on social media that OpenAI has previously published similarly misleading benchmark charts, though these pertained to internal model comparisons. A neutral observer provided a more “accurate” graph reflecting various models’ performances at cons@64, highlighting the complexity of the debate.
AI researcher Nathan Lambert further highlighted a crucial, yet often overlooked, metric: the computational and monetary resources required for each model to attain its highest score. This underlines the limited communication by most AI benchmarks regarding models’ constraints and capabilities.