The AI community is actively grappling with the limitations of traditional evaluation methods and is forging ahead with increasingly sophisticated approaches that strive to capture the nuanced capabilities, potential pitfalls, and true utility of AI systems. This evolution, driven by insights from research papers and the experiences of AI practitioners, signifies a maturing understanding of how we should judge the “intelligence” we are creating.

For Medical AI

"Towards Generalist Biomedical AI" directly addresses the lack of adequate benchmarks for evaluating generalist biomedical AI models. The authors state that "to the best of our knowledge, there have been limited attempts to curate benchmarks for training and evaluating generalist biomedical AI models". While acknowledging the existence of benchmarks like BenchMD, they point out that these are "primarily focused on classification whereas our benchmark also includes generative tasks such as medical (visual) question answering, radiology report generation and summarization". Furthermore, they note that "there is currently no implementation of a generalist biomedical AI system that can competently handle all these tasks simultaneously" on existing benchmarks.

To overcome these limitations, the authors "curate MultiMedBench, a new multimodal biomedical benchmark". MultiMedBench encompasses "14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling". This benchmark comprises "12 de-identified open source datasets and 14 individual tasks" designed to measure the capability of a general-purpose biomedical AI to perform a variety of clinically relevant tasks across data sources like medical questions, radiology reports, pathology, dermatology, chest X-ray, mammography, and genomics.

The motivation behind creating MultiMedBench was to enable the development and evaluation of generalist biomedical AI, addressing the "unmet need" for unified benchmarks in this domain. However, the authors also acknowledge that MultiMedBench has limitations, including the "limited size of the individual datasets (a cumulative size of 1̃ million samples) and limited modality and task diversity (e.g., lacking life sciences such as transcriptomics and proteomics)". They also point to the "lack of large scale multimodal datasets" as a barrier to developing models for a wider variety of biomedical data types.

For 3D Modeling

Similarly, the authors of "LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals" encountered limitations with existing feature matching benchmarks when dealing with image pairs from wide baselines. They mention that SuperGLUE, a set of diverse natural language understanding tasks, "failed with wider baseline" scenarios, specifically when working with shoes and rotation baselines as large as 120 degrees. This suggests that benchmarks designed for general feature matching might not adequately capture the challenges posed by significant viewpoint changes in image pairs relevant to tasks like 3D object pose estimation.

To address this, they developed their own methodology for learnable feature matching across wide baselines, implicitly creating a new evaluation setting for their specific problem domain. Their work utilized the Objectron dataset, although they noted its lack of metric depth data, implying a limitation in its suitability for certain aspects of 3D understanding.

For NLP

In the domain of text summarization, the QuoteSum dataset introduced in "Towards Verifiable Generation: A Benchmark for Knowledge-Aware Language Model Attribution" was created because many existing question answering (QA) datasets, such as SQuAD and Natural Questions, primarily focused on reading comprehension and information consolidation within a closed-domain setting using provided retrieved sources.

The authors of QuoteSum aimed to extend this by including a setup that mimics noisy retriever systems, where models need to filter out irrelevant information from the retrieved sources to perform well. This highlights a perceived insufficiency in the generalization capabilities tested by earlier QA benchmarks when faced with real-world complexities like noisy information retrieval. While QuoteSum itself is a dataset for QA, its design and focus on handling noisy sources implicitly serve as a benchmark for evaluating a model's ability to perform robust knowledge-aware reasoning beyond clean, curated data.

For Computer Vision

The "Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis" paper also points to limitations in existing Optical Character Recognition (OCR) benchmark datasets, particularly concerning the evaluation of character localization. Interviews with AI researchers in this space reveals benchmark datasets are "kinda behind" in evaluating the accuracy of character localization. This suggests that while existing benchmarks might focus on word-level or line-level text recognition, they may not provide sufficient granularity for evaluating the precise localization of individual characters, which is crucial for advanced OCR tasks and understanding the underlying structure of documents.

The paper introduces a new architecture with novel components like UDP (unified detector polygon) and aL2Char2Word recognizer, implicitly suggesting a need for benchmarks that can better assess these fine-grained capabilities. Furthermore, the mention of HierText as "a new dataset with more than 100 words per image" indicates an effort to create datasets that pose different challenges compared to existing ones, possibly focusing on more complex document layouts.

For Web Navigation and AI Agents

"Multimodal Web Navigation with Instruction-Finetuned Foundation Models" explains the creation and use of the MiniWoB++ benchmark. This suggests a need for environments that can evaluate multimodal (visual and textual) interaction capabilities for web navigation tasks. The fact that a smaller model fine-tuned on this domain outperformed much larger general-purpose models like GPT-4 and PaLM 540B on MiniWoB++ indicates that task-specific benchmarks are crucial for evaluating performance in specialized domains, and general benchmarks might not be sufficient to capture the nuances required for successful interaction.

The development of Mind2Web as another benchmark for web interaction further underscores the ongoing effort to create more challenging and relevant evaluation environments for AI agents.

Finally, the ChartQA benchmark, mentioned in the context of "Fixed Grid Pix2struct", highlights a situation where an original fine-tuning dataset was found to be "insufficiently rich for learning solving complex reasoning tasks". To overcome this limitation, concurrent work [Carbune et al., 2024] created synthetic examples and rationales to augment the dataset. This illustrates that even when datasets exist for specific tasks (like question answering over charts), their richness and complexity might need to be enhanced to effectively train and evaluate models on more challenging aspects like complex reasoning.

As models advance and tackle more specialized or complex problems, the limitations of existing benchmarks often become apparent. Researchers respond to these limitations by creating new datasets and benchmarks tailored to better evaluate the specific capabilities and challenges of their work, driving progress in more targeted and meaningful ways. This iterative process of identifying benchmark insufficiencies and developing new evaluation tools is crucial for the continued advancement and reliable assessment of AI technologies.

We provide an analysis of AI model evaluations below, using primary and secondary sources.

Pain Points

Limitations of Traditional Benchmarks: Many established benchmarks are proving to have limitations in reflecting the complexities of real-world applications [Our Conversation History]. They may focus on narrow tasks, lack sufficient diversity, or be susceptible to gaming by models that learn superficial correlations rather than genuine understanding [Our Conversation History]. The disconnect between performance on these benchmarks and real-world utility is a significant concern.

Difficulty in Defining Quality and Desired Behavior: As one researcher put it, even defining data quality is an ambiguous task. The efforts to evaluate code review assistants highlight this challenge, as simple text differences or automated metrics often fail to capture the semantic correctness and overall quality of suggested code edits. The manual review of code examples was deemed necessary to provide a more meaningful evaluation. This underscores the ongoing need for more sophisticated methods to assess the quality of AI-generated content and behaviors, potentially involving more nuanced human rating scales and protocols.

Manual and Resource-Intensive Evaluation Processes: Evaluating complex AI models, especially through human-centric approaches, can be manual, time-consuming, and resource-intensive. Researchers often find themselves manually reviewing thousands of model outputs or meticulously curating evaluation datasets particularly in the medical field. This lack of scalable and efficient evaluation methods can hinder the pace of progress.

Lack of Standardized Evaluation Protocols: The field often lacks standardized evaluation protocols and metrics, making it difficult to compare results across different research groups and models. The variety of evaluation approaches used can lead to inconsistencies and challenges in determining the true state-of-the-art. As participants in the "Understanding the Dataset Practitioners Behind Large Language Models" study noted, there is a lack of alignment across objectives, metrics, and benchmarks.

The "Black Box" Nature of Models: The difficulty in interpreting the decision-making processes of many advanced AI models, particularly deep neural networks, poses a challenge for evaluation. Understanding why a model performs well or poorly on a given task is crucial for identifying its limitations and areas for improvement. The lack of transparency in these "black box" models makes comprehensive evaluation more difficult.

Trends

Holistic evaluation. This involves assessing AI models across a broader spectrum of capabilities and characteristics beyond mere accuracy. The development of benchmarks like HELM, which includes MMLU alongside other evaluations and aims for a more comprehensive assessment, exemplifies this trend. Instead of a single accuracy score, holistic evaluation frameworks consider factors such as robustness to adversarial attacks, fairness across different demographic groups, efficiency in terms of computational resources, explainability of model decisions, and alignment with human values and ethical considerations. The increasing awareness of bias in AI models necessitates evaluation frameworks that can effectively identify and quantify these issues, moving beyond simple performance metrics.

Application-specific benchmarks that are tailored to the unique demands and contexts of particular use cases. The MiniWoB++ benchmark, designed to evaluate multimodal web navigation with instruction-finetuned foundation models, provides a concrete example. Instead of assessing general language or vision capabilities, it focuses on the ability of AI agents to interact with web interfaces to achieve specific goals, a far more relevant evaluation for autonomous agents designed for web-based tasks. Similarly, ScreenQA and ScreenAI focus on understanding mobile app screenshots and user interfaces, addressing the specific challenges of visual and textual understanding in that domain. In the medical field, the development and evaluation of Med-PaLM M and the comprehensive 25 tasks across 14 medical benchmarks show the move towards evaluating AI models on tasks that directly mirror clinical applications and are judged by medical professionals. The preference of radiologists for Med-PaLM M-generated reports over human-generated ones in a significant percentage of cases underscores the importance of domain-expert evaluation in these specialized benchmarks. Long gone are the days of outsourcing evaluation to Mechanical Turk workers.

Dynamic and adversarial testing. Static datasets, while valuable, may not adequately probe the limitations and failure modes of AI models. Dynamic evaluation involves generating new test cases or scenarios on the fly, often tailored to challenge specific aspects of a model's capabilities. Adversarial attacks, which involve crafting subtle perturbations to inputs that can fool even highly accurate models, are becoming increasingly important for assessing robustness, particularly in safety-critical applications. The research into the susceptibilities of image generation models to create gender-biased image results for certain professions, as highlighted by researchers probing Stable Diffusion, underscores the need for evaluations that go beyond standard benchmark datasets to uncover unexpected failure modes and biases.

Experts-only evaluation. The evaluation of radiology reports generated by Med-PaLM M by radiologists is a prime example of the critical role of human expertise in assessing the clinical utility and accuracy of AI in specialized domains. Similarly, research on alt text generation for AI-generated images highlights the importance of involving screen reader users (SRUs) in the evaluation process to ensure accessibility and align with their specific needs and preferences. The findings that creators' alt text, while high-ranking, still misaligned with SRUs’ preferences in aspects like subjective language underscores the necessity of diverse human perspectives in evaluation.

LLMs for Evaluation: There is a growing trend of using AI models themselves to assist in the evaluation process. Large language models, for instance, are being explored for tasks such as generating evaluation questions, assessing the quality of generated text, and even identifying potential biases in other models. This approach offers the potential for more scalable and automated evaluation. Synthetic data is also emerging as a valuable tool in advancing AI evaluation. By creating controlled and labeled datasets, synthetic data can be used to test specific capabilities or edge cases that might be rare or difficult to capture in real-world data. The use of synthetic data in robotics and 3D modeling demonstrates its potential for creating challenging evaluation environments. Furthermore, LLMs themselves are being explored for their ability to generate synthetic evaluation data for training smaller, cheaper evaluation models, potentially offering a more scalable and cost-effective approach to assessing AI performance.

Opportunities: Towards More Reliable and Trustworthy AI

Developing More Sophisticated Metrics: Future research can focus on developing more sophisticated holistic metrics and evaluation frameworks that can effectively capture the multifaceted nature of AI capabilities and societal impact. This may involve integrating insights from various disciplines, including ethics, social sciences, and human-computer interaction. As new AI paradigms emerge, such as more advanced reasoning models, agents, and multimodal systems, the development of appropriate evaluation methods will be crucial for guiding their progress and ensuring their responsible development and deployment.

Creation of Standardized, Shareable, and Living Benchmarks: Efforts to create standardized, shareable, and continuously updated ("living") benchmarks that can adapt to the rapidly evolving capabilities of AI models would be highly valuable. This could facilitate more consistent and meaningful comparisons across different research efforts. Future evaluation efforts should increasingly focus on assessing the performance of AI systems in real-world deployment scenarios and measuring their actual impact on users and society. This will require developing new methodologies for in-the-wild evaluation and impact assessment.

Integration of AI-Assisted Evaluation Tools and Platforms: Further development and adoption of AI-assisted evaluation tools and platforms can help to automate and scale certain aspects of the evaluation process, making it more efficient and cost-effective. Moreover, research can explore best practices for effective collaboration between humans and AI in the evaluation process, leveraging the strengths of both to achieve more comprehensive and reliable assessments.

Conclusion

In conclusion, the field of AI research benchmarks and model evaluation is undergoing a significant transformation. Driven by the limitations of traditional methods and the increasing demands of real-world AI deployments, researchers and practitioners are developing increasingly advanced approaches. These include holistic evaluation frameworks, application-specific benchmarks, dynamic and adversarial testing, a renewed emphasis on human-centric evaluation, and the innovative use of synthetic data. While significant challenges remain, particularly in defining quality and ensuring widespread adoption of new benchmarks, this evolution signifies a crucial step towards a more mature and nuanced understanding of AI capabilities and limitations, ultimately paving the way for more reliable, trustworthy, and impactful AI systems.