why-ai-leaderboards-are-inaccurate-and-how-to-fix-them

“`html

Methods for evaluating chess players and athletes do not always apply to AI. Researchers from U-M recognize and delineate optimal practices.

Four wrestlers, each symbolizing a genAI model, contend in a wrestling ring. A green wrestler, representing Claude, grapples with GPT-4 in blue. A purple wrestler, depicting Gemini, lunges at Mistral in orange.
Online leaderboards assess AI models by asking individuals to evaluate the generated content in direct comparisons, in a process the researchers term an ‘LLM Smackdown.’ An erroneous ranking system could erroneously crown a model the champion. Image credit: Generated by Google Gemini 2.5 Flash and modified by Derek Smith

Inaccurate ranking systems prevalent in AI leaderboards can be addressed through strategies evaluated at the University of Michigan.

In their investigation, U-M researchers examined the effectiveness of four ranking approaches utilized in popular online AI leaderboards, including Chatbot Arena, alongside other sports and gaming leaderboards. They discovered that the nature and application of a ranking approach can result in varying outcomes, even when using the same crowdsourced dataset of model performance. Based on their findings, the researchers established recommendations for leaderboards to accurately reflect the true performance of AI models.

Lingjia Tang
Lingjia Tang

“Major corporations continue to unveil newer and larger gen AI models, but how can one determine which model is genuinely the finest if the evaluation methods are inaccurate or insufficiently researched?” stated Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.

“The public is becoming increasingly keen on embracing this technology. To do this effectively, we require sturdy methods to assess AI across various applications. Our research pinpoints what constitutes an effective AI ranking system and offers guidance on when and how to implement them.”

Evaluating gen AI models is challenging due to the subjective nature of assessments on AI-generated content. Some leaderboards measure how well AI models execute specific tasks, such as answering multiple-choice questions; however, these do not evaluate how effectively an AI produces diverse content without a singular correct answer.

To assess more open-ended outcomes, other leaderboards, like the popular Chatbot Arena, invite individuals to rate the generated content in direct comparisons, a process the researchers refer to as an “LLM Smackdown.” Human participants anonymously submit a prompt to two random AI models and then log their favored response in the leaderboard’s database, which is later utilized by the ranking system.

However, the rankings can vary based on how the systems are implemented. Chatbot Arena previously employed a ranking algorithm known as Elo, which is commonly utilized to assess chess players and athletes. It features settings that allow users to determine how significantly a win or loss influences the leaderboard’s standings and how this effect adjusts based on the age of the player or model. In theory, these attributes enable a ranking system to be more adaptable, but determining the appropriate settings for evaluating AI is not always straightforward.

“In chess and sports matches, there’s a sequential order of games that evolves as players’ skills develop throughout their careers. But AI models do not change between versions, and they can simultaneously and instantly engage in numerous games,” noted Roland Daynauth, a doctoral student in computer science and engineering at U-M and the study’s lead author.

To mitigate accidental misuse, the researchers scrutinized each rating system by inputting a segment of two crowdsourced datasets of AI model performance—one from Chatbot Arena and another previously gathered by the researchers. They then analyzed how accurately their rankings corresponded to the win rates within a withheld segment of the datasets. They also evaluated the sensitivity of each system’s rankings to user-defined settings and whether the rankings adhered to the rationale behind all pairwise comparisons: If A defeats B, and B defeats C, then A must be ranked higher than C.

The researchers found that Glicko, a ranking system employed in e-sports, tends to produce the most consistent results, particularly when the number of comparisons is uneven. Other ranking systems—like the Bradley-Terry system that Chatbot Arena implemented in December 2023—could also yield accurate results, but only when each model underwent an equal number of comparisons. Such a system might allow a newer model to appear stronger than warranted.

Jason Mars
Jason Mars

“The appearance of a model that defeats a grandmaster does not automatically indicate that it is the best model available. It requires an extensive number of games to discern the reality,” remarked Jason Mars, associate professor of computer science and engineering at U-M and a co-corresponding author of the study.

Conversely, the rankings produced by the Elo system, along with the Markov Chains utilized by Google to rank web pages in search results, were significantly influenced by user configurations of the system. The Bradley-Terry system lacks user-defined settings, making it potentially the optimal choice for extensive datasets with an even number of comparisons for each AI.

“There’s no singular correct answer, so ideally our analysis will aid in shaping how we evaluate the AI industry as we progress,” Tang stated.

This research received funding from the National Science Foundation.

“`


Leave a Reply

Your email address will not be published. Required fields are marked *

Share This