AI research labs are increasingly turning to crowdsourced benchmarking platforms, such as Chatbot Arena, to assess the performance of their latest models. However, several experts warn that relying heavily on crowdsourced benchmarks poses significant ethical and methodological challenges.
In recent years, prominent labs—including OpenAI, Google, and Meta—have incorporated crowdsourced evaluations into their strategies, asking volunteers to test new models’ capabilities. These labs then frequently highlight positive benchmark performances as key indicators of their technological advancements.
But linguistics professor Emily Bender from the University of Washington argues this method is fundamentally flawed. Bender, co-author of the book “The AI Con,” emphasized that for a benchmark to be credible, it must clearly define what exactly is being measured. She believes crowdsourced platforms like Chatbot Arena fall short in demonstrating that the participants’ choices meaningfully reflect genuine or well-defined preferences.
Asmelash Teka Hadgu, co-founder of AI company Lesan and a fellow at the Distributed AI Research Institute, contends that platforms like Chatbot Arena risk being manipulated by AI labs to inflate performance claims. He pointed specifically to a recent controversy involving Meta’s Llama 4 Maverick model. After fine-tuning one version of Maverick to score highly on Chatbot Arena, Meta opted not to publicly release that optimized model, choosing instead an inferior-performing variant. According to Hadgu, effective benchmarking should be dynamic rather than static, managed by multiple independent organizations, and precisely tailored to practical use cases in sectors such as healthcare or education. Additionally, Hadgu proposes that evaluators contributing their efforts should receive proper financial compensation.
Kristine Gloria, previously of the Aspen Institute’s Emergent and Intelligent Technologies Initiative, agreed and underscored the need to learn from data labeling industry’s troubled history, where labor exploitation raised major ethical concerns. Gloria views crowdsourcing positively as a way to include diverse and valuable feedback but insists it cannot stand alone as the definitive evaluation method, given the pace at which technologies rapidly evolve.
Matt Frederikson, CEO of Gray Swan AI—a firm conducting crowdsourced AI “red-teaming” for security and robustness testing—mentioned his volunteers engage in evaluations for various reasons, often out of personal interest or to develop specific technical skills. Gray Swan compensates testers through cash prizes as incentives. Nonetheless, Frederikson noted that while crowdsourced platforms provide considerable value, model developers also need rigorous internal assessments and specialized, paid evaluators capable of deep testing and domain expertise.
OpenRouter CEO Alex Atallah echoed similar sentiments, suggesting open benchmarks and public testing, though valuable, cannot replace comprehensive internal evaluations and rigorous external reviews. Wei-Lin Chiang, UC Berkeley AI doctoral student and co-founder of LM Arena—the organization behind Chatbot Arena—acknowledged recent controversies surrounding their platform stemmed from labs misinterpreting benchmarking objectives rather than flaws in their methodology. According to Chiang, recent policy updates now aim to clarify the platform’s commitment to transparent and reproducible evaluations to avoid future misunderstandings.
Chiang explained LM Arena sees itself primarily as an open, community-based initiative providing an accessible and transparent space for collective feedback, rather than as an authoritative or isolated evaluation tool. Despite the criticism, Chiang insists crowdsourced benchmarks provide valuable insights, provided their results truly reflect community perspectives.