Earlier this week, Meta found itself embroiled in controversy after using an unreleased, experimental variant of its Llama 4 Maverick AI model to secure a high rank on the widely followed LM Arena benchmark. The revelation led the benchmark’s administrators to issue an apology, revise their entry policies, and promptly score the officially released “vanilla” Maverick. The results have painted a decidedly less flattering picture.
The unaltered Llama 4 Maverick, designated officially as “Llama-4-Maverick-17B-128E-Instruct,” landed significantly below many of its rival models, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro—some of which were released months ago. As of late this week, Meta’s model ranked a disappointing 32nd place, far below initial expectations.
Meta attributed the stark disparity in scores between the experimental and the official releases to specialized optimizations implemented in the experimental version. In a recent disclosure, the company noted that this earlier variant—”Llama-4-Maverick-03-26-Experimental”—had been specifically tuned for enhanced conversational style, an attribute particularly favored in LM Arena’s methodology which relies heavily on human judgment of conversational quality and engagement.
Though LM Arena remains a popular means of comparing rival models, industry experts have long questioned the robustness and universal reliability of its evaluation criteria. Tailoring a model explicitly for success in such benchmarks, critics contend, provides a limited perspective and can introduce misleading implications about broader performance in diverse, real-world scenarios.
In response to the controversy, Meta released a statement emphasizing their commitment to experimentation and openness. “We regularly experiment with a range of specialized model variants,” a Meta spokesperson explained. “‘Llama-4-Maverick-03-26-Experimental’ represents one such conversationally optimized model tested internally, which happened to perform particularly well in LM Arena’s settings. We’ve since made the vanilla model publicly accessible, and we’re eager to observe how developers adjust and evolve Llama 4 to excel in their own specific applications. We also look forward greatly to their ongoing feedback and innovation.”