A discrepancy between benchmark scores reported by OpenAI and independent evaluations of its o3 AI model has raised concerns around the company’s transparency and testing practices.
When OpenAI introduced the o3 model in December, it claimed the model could correctly answer more than 25% of challenging mathematical questions in the FrontierMath benchmark suite. At the time, OpenAI Chief Research Officer Mark Chen highlighted the substantial performance gap, noting that other models on the market could answer fewer than 2% of the questions correctly.
However, recent independent benchmark results published by Epoch AI—the creators of FrontierMath—indicate a much lower performance by the publicly released version of the o3 model. According to Epoch’s evaluations, the publicly available o3 achieved a correct answer rate of about 10%, far below the initial score touted by OpenAI.
Epoch AI pointed out that the discrepancy could stem from differences in evaluation methods or test setups. Their latest benchmark used an updated version of FrontierMath featuring more problems than the earlier version OpenAI originally employed. Additionally, Epoch suggested that OpenAI may have tested its model internally using greater computational resources, reflecting performance levels not representative of the public version.
Supporting this notion, the ARC Prize Foundation, which assessed an early pre-release variant of the o3 model, stated publicly that the final consumer-release model was indeed a different version—fine-tuned specifically for practical chatbot and product-oriented deployments rather than optimized exclusively for benchmark tests. According to ARC Prize, all publicly available o3 compute tiers utilize less computational power than the highly optimized internal version previously benchmarked by OpenAI.
The revelation does not imply deliberate dishonesty from OpenAI, since the initial disclosures in December provided both the upper and lower-bound results, the latter matching Epoch’s most recent figure. Moreover, OpenAI has released subsequent models such as the o3-mini-high and o4-mini, both of which reportedly outperform the original o3 on FrontierMath. The company also plans to launch an enhanced version called o3-pro in the near future.
Nevertheless, the differing benchmark outcomes serve as a cautionary reminder regarding the interpretation of AI performance benchmarks, especially when produced or promoted by companies invested in the success of their products. Controversies related to AI benchmarking have become increasingly common industry-wide, as fierce competition drives firms to highlight the most favorable assessments.
Earlier this year, Epoch itself was subject to criticism after delaying disclosure of funding it had received from OpenAI until after that company’s announcement of the o3 model. More recently, other high-profile industry players faced similar scrutiny: Elon Musk’s xAI was accused of publishing misleading benchmarks for its Grok 3 model, and Meta acknowledged that the benchmark scores it promoted were from a version of a model different than the one released to developers.