In December, OpenAI introduced its o3 “reasoning” AI model and collaborated with the creators of the ARC-AGI benchmark to demonstrate the model’s capabilities. However, in recent months, revised results indicate that the model’s performance appears less impressive than initially reported.
The Arc Prize Foundation, responsible for maintaining and administering ARC-AGI, recently updated the estimated computing costs for o3. Initially, the cost for the most efficient configuration tested, o3 high, was estimated to be around $3,000 for each ARC-AGI problem. The foundation now believes the cost could be significantly higher, estimating it at approximately $30,000 per task.
This revision highlights the potential expenses associated with utilizing advanced AI models for certain tasks, especially in the early stages. OpenAI has not yet released or priced the o3 model; however, the Arc Prize Foundation suggests that the pricing for OpenAI’s o1-pro model, which is the most expensive model to date, could serve as a reasonable proxy.
Mike Knoop, co-founder of the Arc Prize Foundation, conveyed to TechCrunch that o1-pro could be a more accurate representation of o3’s true cost because of the extensive test-time computing required. Nonetheless, the foundation continues to label o3 as a preview on its leaderboard to acknowledge the existing uncertainty until official pricing details are released.
The high cost associated with o3 high is justified given the substantial computing resources it reportedly consumes. According to the Arc Prize Foundation, o3 high uses 172 times more computing power than o3 low, the configuration requiring the least computing power, to address ARC-AGI tasks.
Additionally, there have been ongoing speculations regarding OpenAI’s plans to introduce costly subscriptions for enterprise customers. In early March, reports emerged suggesting the company might charge as much as $20,000 per month for specialized AI agents, such as a software developer agent.
Although some may argue that OpenAI’s most expensive models still cost less than a typical human contractor or employee, AI researcher Toby Ord noted that the models might not be as efficient. For instance, o3 high required 1,024 attempts at each ARC-AGI task to achieve its best score.