The AI industry is at a turning point. While rapid progress continues in developing powerful new models, the methods we use to evaluate these models are struggling to keep pace. Traditional benchmarks, once the gold standard, are increasingly recognized as limited, even misleading, indicators of real-world performance. This article explores the growing "benchmark crisis" and the rise of new, user-centric approaches to AI evaluation.
Why Benchmarks are Falling Short for both Researchers and Startups
Benchmarks, the seemingly objective yardsticks of AI progress, are showing their limitations. A researcher involved in creating the CodeQueries dataset, explains that existing coding benchmarks are becoming "saturated," with models achieving high scores without necessarily demonstrating true semantic understanding. This sentiment is echoed across various AI domains. The paper "Towards Generalist Biomedical AI" directly addresses this, stating, "to the best of our knowledge, there have been limited attempts to curate benchmarks for training and evaluating generalist biomedical AI models." The authors then "curate MultiMedBench, a new multimodal biomedical benchmark", comprised of "14 diverse tasks" (Moor et al., 2023).
Saturation isn't the only issue. Benchmarks can be "gamed"—models can be optimized for specific metrics without generalizing well to real-world scenarios. For example, a model might excel at a question-answering benchmark by learning to exploit statistical biases in the dataset, rather than truly understanding the underlying text.
Evaluating generative AI adds another layer of complexity. As one interviewee from the "Understanding the Dataset Practitioners" study noted, "If you're doing simple classification, it's easy to measure... But with generative models, evaluation is very subjective." The inherent subjectivity of the output makes traditional metrics like accuracy far less applicable. Another researcher mentions that "For researchers: benchmarks aren’t sufficient and don’t keep up with new model capabilities." The paper "Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis" adds that benchmark datasets are behind in evaluating the accuracy of character localization (Long et al., 2024).
For startups deploying AI models, benchmark performance often fails to correlate with actual product quality or user satisfaction. As the head of Customer Operations at one AI startup, put it, customers ultimately care whether the technology works and meets their specific needs, not about abstract scores. At the time of the interview, context length was a key differentiator as opposed to benchmark scores.
Lastly, as one interviewee puts it, the rising "open source secrecy trend makes benchmarks less of a widely respected thing."
The Rise of User-Centric Evaluation:
The limitations of benchmarks are driving a shift towards prioritizing user experience and real-world impact. As one interviewee noted, evaluations today often rely on "intuition or crowdworkers" rather than the needs of end-users. Achieving high scores on academic benchmarks doesn't always translate to effective or reliable performance in real-world applications such as coding in a real company setting. Beyond that, AI models that make it out of the research lab may perform worse in reality given customer expectations and preferences - which may not align with researchers or those who evaluated the model (i.e. crowdworkers).
If benchmarks aren't the be-all and end-all, what is? The resounding answer from the field is the user experience. "Good performance" is inextricably linked to the specific context and the user's needs. Factors like alignment with end user preferences become paramount. While researchers may align more with consumer-centric tastes through crowdsourced workers, enterprise deployments of AI for use cases such as customer service would benefit from greater alignment with business-centric definitions of quality.
Embracing Subjectivity and Human Eval
We cannot yet fully automate the assessment of "intelligence," especially when it comes to nuanced, subjective, or context-dependent capabilities. Even in relatively mature fields like Optical Character Recognition (OCR), where seemingly objective metrics like character recognition accuracy exist, the push towards finer-grained evaluation (e.g., character localization in the "Hierarchical Text Spotter" paper) reveals the limitations of purely automated assessments. The ultimate goal is not just recognizing characters, but understanding their meaning and relationships within a larger context, a task that often requires human judgment.
This challenge is even more pronounced in emerging areas like 3D modeling and generation, where we are rapidly approaching the "uncanny valley." Creating a 3D model that looks like a real person is becoming increasingly feasible, but evaluating whether it feels real, natural, and believable requires subjective human perception. This applies equally to the burgeoning field of spatial computing, where AI will need to interact with and understand complex, dynamic 3D environments.
Human judgment remains irreplaceable, especially for evaluating subjective aspects of AI. One researcher when discussing speech models highlighted that while automated metrics exist, "Ideally we want a human rater to score... because there isn't a perfect computational model of human perception".” This need for human-in-the-loop evaluation is particularly acute in areas with ethical implications, as highlighted by studies on disability representation, where automated metrics often fail to capture harmful biases..
Open Source: A Double-Edged Sword for Evaluation
Open source plays a dual role in this evolving landscape. It's a powerful catalyst for community-driven evaluation, allowing broader scrutiny of models and architectures. AI21 Labs' intention to let the community experiment with their Jamba architecture exemplifies this. The open-source nature allows for more transparent and reproducible evaluations. Researchers and developers can inspect the code, the training data (if available), and the evaluation procedures, fostering trust and allowing for independent verification of results.
However, open sourcing is not without its complexities. Companies must strategically balance the benefits of open evaluation with the need to protect trade secrets, which sometimes results in "bad engineering" released to open-source communities.
Specific Examples of New Evaluation Approaches:
The limitations of traditional benchmarks are driving the creation of new, more targeted evaluation methods:
MultiMedBench (Moor et al., 2023): A multimodal benchmark for biomedical AI, encompassing 14 diverse tasks, including generative tasks like radiology report generation.
CodeQueries (Maniatis et al., 2024): Focuses on semantic understanding of code through queries, going beyond simple code completion accuracy.
ScreenQA: Evaluates AI's ability to answer questions about mobile app screenshots, with metrics shifting from bounding box accuracy to textual answer quality.
Hierarchical Text Spotter (Long et al., 2024): Highlights the need for character-level evaluation in OCR, going beyond word-level recognition.
QuoteSum: A question-answering benchmark that incorporates noisy data, reflecting real-world information retrieval challenges.
MiniWob++: specifically designed to evaluate multimodal (visual and textual) interaction capabilities for web navigation tasks, showing a need for environments.
Recommendations
The clear trend is a move towards incorporating user preferences and real-world data into the evaluation process. Customers care about whether a model works for them, not about abstract benchmark scores. This underscores the importance of focusing on user-centric metrics and gathering feedback from real-world deployments. The success of techniques like crowdsourcing, used by OpenAI to create InstructGPT and guide what prompts to cater to, demonstrates the power of leveraging human feedback at scale. InstructGPT's ability to generate more helpful and aligned responses was directly a result of training on data generated through human evaluation and preference ranking. This "human-in-the-loop" approach is becoming increasingly essential.
In this evolving landscape, startups, with their often tighter feedback loops with customers, may possess a significant advantage. They are often better positioned to gather real-world usage data and rapidly iterate on their models and evaluation methods based on direct user feedback. This agility can be a crucial differentiator in a field where benchmarks quickly become outdated, and the definition of "good performance" is constantly evolving. Ultimately, the most successful AI systems will be those that are not just technically proficient, but also demonstrably useful, reliable, and aligned with human values – a judgment that requires a blend of quantitative metrics and qualitative human insight
In conclusion, we provide recommendations on what would serve as more useful paths forward beyond AI leaderboards, which rely on limited benchmarks:
Prioritizing User-Centric Metrics: This means focusing on how well a company's AI solutions address real-world user needs and deliver tangible value, rather than benchmark scores.
Invest in Evaluation Infrastructure: This include solutions for continuous evaluation and live monitoring. Unlike static benchmarks, real-world AI operates in a dynamic environment. Models must be continuously monitored and re-evaluated as data distributions shift, user needs evolve, and new challenges arise. This necessitates a shift from one-off evaluations to ongoing assessment and adaptation. MLOps platforms that facilitate continuous integration, continuous deployment, and continuous training (CI/CD/CT) are becoming increasingly crucial for maintaining AI system performance and reliability.The growing need for better tools and frameworks for evaluating AI, especially generative models, represents a significant investment opportunity in fields such as
Platforms for human-in-the-loop evaluation.
Synthetic data generation for robust testing.
Red-teaming platforms for identifying vulnerabilities.
Explainability and interpretability tools.
Development of new metrics beyond accuracy.
Seek Vertical Expertise: Companies focusing on specific vertical applications, where "good performance" can be more concretely defined and measured, may present more attractive and less risky investment opportunities. For example:
In healthcare, prioritize solutions that demonstrably improve patient outcomes or diagnostic accuracy.
In customer service, focus on companies that improve CSAT scores and reduce resolution times.
In autonomous driving, prioritize safety metrics and real-world performance data over simulated benchmarks.
Value Human-in-the-Loop Expertise: Companies providing human evaluation services, particularly for complex or subjective tasks, will become increasingly valuable.