A new research paper authored by teams from Cohere, Stanford, MIT, and Ai2 has accused LM Arena, the organization behind the well-known AI benchmarking site Chatbot Arena, of giving preferential treatment to leading tech companies and helping them artificially inflate their performance scores.
The analysis claims that LM Arena allowed selected industry leaders—including Meta, OpenAI, Google, and Amazon—to privately test multiple variations of their artificial intelligence models without requiring disclosure of all resulting scores. The researchers argue that this practice permitted those companies to reveal only their best-performing models, effectively gaming Chatbot Arena’s rankings and disadvantaging competitors who were unaware of or lacked access to similar opportunities.
Sara Hooker, Cohere’s VP of AI research and a co-author of the paper, described this selective private testing as “gamification.” She stated in an interview that only a few companies were made aware that extensive private model evaluation sessions were available and allowed, thus benefiting disproportionately from this arrangement.
Initially developed as an academic research project at UC Berkeley in 2023, Chatbot Arena quickly rose in popularity within the AI sector. The benchmark functions by comparing responses from two competing AI models side-by-side, with users voting for preferred outputs in match-ups referred to as “battles.” The accumulated votes then inform each model’s standing on Chatbot Arena’s competitive leaderboard.
The authors documented that during the three months preceding the launch of Meta’s Llama 4 model, Meta privately tested as many as 27 different model variants. When the model launched publicly, only one variant—ranking notably high on the leaderboard—had its score released, according to the researchers’ findings.
When confronted with these allegations, LM Arena co-founder and UC Berkeley professor Ion Stoica categorically dismissed the study’s conclusions. He argued that the researchers’ methodology and analysis were flawed, and emphasized LM Arena’s commitment to fair and open benchmarking processes. The organization also noted publicly that it does not limit how frequently companies can test their models, and it stated that discrepancies in test volumes alone do not inherently constitute unfair treatment.
Furthermore, Armand Joulin, a principal researcher at Google DeepMind, claimed inaccuracies within the study. For instance, he stated that Google only tested a single model variant on LM Arena’s platform prior to its release, contradicting claims from the research paper. In response, Hooker acknowledged the potential misstatement and pledged an immediate revision.
According to the researchers, LM Arena’s bias extended beyond private testing—certain large companies’ models appeared disproportionately more often in public battles, allowing these models an advantage in training data collection and refining performance. The study estimated that the extra exposure and resulting data could boost a model’s performance significantly.
However, LM Arena contested these findings, publishing a response stating that the researchers’ analysis was incomplete and maintained that smaller AI developers also frequently engaged in extensive testing on the site. Additionally, LM Arena underscored the inherent difficulty of verifying claims of unfairness, given the researchers’ reliance on self-identifying models to ascertain company associations in their analysis.
The authors urged LM Arena to revise their current practices to promote transparency and equality, suggesting specific adjustments such as limiting and publicly disclosing the number of private tests companies are permitted, and implementing a balanced algorithm ensuring uniform public exposure for all models.
LM Arena expressed openness toward refining their sampling algorithm but pushed back against suggestions of disclosing private test scores. They stated that revealing scores of unreleased, unavailable models provides little benefit, given the community’s inability to independently validate such results.
These allegations arrive shortly after another related controversy involving Meta, whose Llama 4 benchmark-leading conversational variant was never publicly released. Instead, Meta launched a different, less effective publicly available model, prompting criticism from LM Arena itself over transparency issues.
With LM Arena recently announcing plans to transition from research organization to private company, concerns like those outlined by the authors have drawn increased attention to the broader role that privately-managed benchmarking entities play—and the potential for conflicts between accurate industry-wide assessments and corporate interests.