Fixing AI's Evaluation Crisis

Introduction to AI Evaluation

As a tech reporter, I often get asked questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?” If I don’t feel like turning it into an hour-long seminar, I’ll usually give the diplomatic answer: “They’re both solid in different ways.” Most people asking aren’t defining “good” in any precise way, and that’s fair. It’s human to want to make sense of something new and seemingly powerful. But that simple question—Is this model good?—is really just the everyday version of a much more complicated technical problem.

The Limitations of Benchmarks

So far, the way we’ve tried to answer that question is through benchmarks. These give models a fixed set of questions to answer and grade them on how many they get right. But just like exams like the SAT (an admissions test used by many US colleges), these benchmarks don’t always reflect deeper abilities. Lately, it feels as if a new AI model drops every week, and every time a company launches one, it comes with fresh scores showing it beating the capabilities of predecessors. On paper, everything appears to be getting better all the time.

The Evaluation Crisis

In practice, it’s not so simple. Just as grinding for the SAT might boost your score without improving your critical thinking, models can be trained to optimize for benchmark results without actually getting smarter. As OpenAI and Tesla AI veteran Andrej Karpathy recently put it, we’re living through an evaluation crisis—our scoreboard for AI no longer reflects what we really want to measure. Benchmarks have grown stale for a few key reasons. First, the industry has learned to “teach to the test,” training AI models to score well rather than genuinely improve. Second, widespread data contamination means models may have already seen the benchmark questions, or even the answers, somewhere in their training data. And finally, many benchmarks are simply maxed out. On popular tests like SuperGLUE, models have already reached or surpassed 90% accuracy, making further gains feel more like statistical noise than meaningful improvement. At that point, the scores stop telling us anything useful.

New Approaches to AI Evaluation

However, there are a growing number of teams around the world trying to address the AI evaluation crisis. One result is a new benchmark called LiveCodeBench Pro. It draws problems from international algorithmic olympiads—competitions for elite high school and university programmers where participants solve challenging problems without external tools. The top AI models currently manage only about 53% at first pass on medium-difficulty problems and 0% on the hardest ones. These are tasks where human experts routinely excel.

LiveCodeBench Pro and Other Initiatives

Zihan Zheng, a junior at NYU and a world finalist in competitive coding, led the project to develop LiveCodeBench Pro with a team of olympiad medalists. They’ve published both the benchmark and a detailed study showing that top-tier models like GPT-4o mini and Google’s Gemini 2.5 perform at a level comparable to the top 10% of human competitors. Across the board, Zheng observed a pattern: AI excels at making plans and executing tasks, but it struggles with nuanced algorithmic reasoning. Other initiatives, such as Xbench, a Chinese benchmark project, and LiveBench, a dynamic benchmark created by Meta’s Yann LeCun, aim to evaluate models not just on knowledge but on adaptability and practical usefulness.

The Future of AI Evaluation

AI researchers are beginning to realize—and admit—that the status quo of AI testing cannot continue. At the recent CVPR conference, NYU professor Saining Xie drew on historian James Carse’s Finite and Infinite Games to critique the hypercompetitive culture of AI research. An infinite game, he noted, is open-ended—the goal is to keep playing. But in AI, a dominant player often drops a big result, triggering a wave of follow-up papers chasing the same narrow topic. This race-to-publish culture puts enormous pressure on researchers and rewards speed over depth, short-term wins over long-term insight.

Conclusion

So, do we have a truly comprehensive scoreboard for how good a model is? Not really. Many dimensions—social, emotional, interdisciplinary—still evade assessment. But the wave of new benchmarks hints at a shift. As the field evolves, a bit of skepticism is probably healthy. The development of new evaluation methods and the acknowledgment of the limitations of current benchmarks are steps towards a more nuanced understanding of AI capabilities.

FAQs

Q: What is the main issue with current AI benchmarks?
A: Current AI benchmarks have become stale and do not accurately reflect the true abilities of AI models. They can be gamed by training models to optimize for benchmark results rather than genuine improvement.
Q: What are some new approaches to AI evaluation?
A: New approaches include LiveCodeBench Pro, Xbench, and LiveBench, which aim to evaluate models on practical usefulness, adaptability, and nuanced algorithmic reasoning.
Q: Why is there a need for new evaluation methods?
A: The need for new evaluation methods arises from the limitations of current benchmarks, which have been maxed out by top models, and the realization that the current evaluation crisis does not accurately reflect what we want to measure in AI models.
Q: What is the future of AI evaluation?
A: The future of AI evaluation involves moving beyond current benchmarks to more comprehensive and nuanced methods that assess models on a variety of dimensions, including social, emotional, and interdisciplinary abilities.