Several factors appear to be driving innovation and breakthroughs in the field of AI, as highlighted by the interviewees:

  • Advancements in Foundational Models and Architectures: The development of more powerful foundational models, such as large language models (LLMs) and diffusion models, is a significant driver. The improved capabilities of these models are enabling research in areas previously considered intractable. For instance, the progress in LLMs has shifted the focus of the ScreenAI team towards adapting these models for screen understanding. AI21 Labs emphasizes its unique architecture, Jamba, a combination of Mamba and Transformer, as a key differentiator enabling longer context windows. The development of the Transformer architecture itself was a breakthrough achieved by a small team within Google Research.

  • Focus on Data Quality and Availability: Interviewees repeatedly stressed the importance of high-quality and diverse data for training effective AI models. The shift from a focus on data quantity to data quality is a noted trend. However, defining and acquiring such data remains a significant challenge. One researcher mentioned that a lack of sufficient high-quality datasets is a major bottleneck in achieving better research results. Researchers are employing creative methods to obtain data, including collaborations with hospitals and leveraging internal datasets. The use of synthetic data generation, particularly for areas like 3D and for training smaller inference models, is also an evolving area of innovation.

  • Increased Computational Power: Access to and the cost of computational resources are crucial factors influencing the pace of research. Training and running inference on large models require significant compute, making it more expensive to conduct research. Overcoming compute limitations and developing more efficient training methods are areas of ongoing effort.

  • Addressing Specific User Needs and Use Cases: Innovation is often driven by the need to solve real-world problems and address specific user requirements. For example, the development of StreamVC at Google was motivated by the need for on-device, real-time voice conversion with feature parity to existing applications. AI21 Labs focuses on task-specific models to meet enterprise needs for functionalities like summarization. The AI-powered patching project originated from the memory safety team's need to address a backlog of bug fixes.

  • Collaboration and Knowledge Sharing: Collaboration within research teams and sometimes with external institutions is essential for driving progress. However, navigating collaborations with external organizations can involve hurdles. The open-source AI landscape also plays a role, though academia may face disadvantages compared to well-funded industry labs.

  • Focus on Evaluation and Benchmarking: The recognition that current evaluation methods are a significant bottleneck is driving innovation in this area. There is a push to develop more robust and relevant evaluation metrics to bridge the gap between research and production. The process of gaining buy-in for new benchmarks often involves using public datasets to ensure credibility within the research community.

Essential Datasets Powering AI Progress:

The advancements in AI research and capabilities are intrinsically linked to the availability and quality of training data. This report dedicates a section to highlight notable datasets, as they represent a critical infrastructure underpinning progress across diverse AI domains. These datasets, often domain-specific and meticulously curated, enable the development and benchmarking of new models, drive innovation in algorithmic design, and ultimately determine the scope and limitations of AI applications. Understanding the landscape of these datasets is therefore essential for assessing the current state and future trajectory of AI research

I. Natural Language Processing (NLP) Datasets

NLP datasets are essential for training models that can understand, interpret, and generate human language.

  1. The Pile: This massive 800GB dataset of diverse text is a go-to resource for pre-training large language models, thanks to its size and wide range of content.

  2. SQuAD (100,000+ Questions for Machine Comprehension of Text): A long-standing benchmark dataset for question answering, SQuAD helps researchers evaluate how well models can understand and answer questions based on provided text.

  3. NQ (Natural Questions) dataset: This dataset focuses on reading comprehension and information consolidation, often used as a foundation for more complex, open-domain question answering systems.

  4. HotpotQA: Designed to test multi-hop reasoning, HotpotQA challenges models to synthesize information from multiple sources to answer questions.

  5. CNN/DM dataset: Paired with human-written summaries, this dataset of news articles from CNN and the Daily Mail is a standard resource for training and evaluating news summarization models.

  6. WikiQA: A key resource for open-domain question answering research.

  7. PAQ dataset: Used in the creation of other datasets, PAQ is valuable for NLP research related to question generation.

  8. WikiText-103: Used for pre-training language models.

  9. TBC dataset: Another resource used for pre-training language models.

II. Computer Vision Datasets

Computer vision datasets enable AI models to "see" and interpret the visual world, driving advancements in image recognition, object detection, and more.

  1. ImageNet ImageNet-Sk ImageNet-R: A foundational dataset in computer vision, ImageNet's scale makes it ideal for training and benchmarking image classification models.

  2. HierText: With over 100 words per image, this dataset is crucial for research on Hierarchical Text Spotters, enabling AI to understand text within complex visual scenes.

  3. Total-text: This dataset supports the development of advanced scene text detection and recognition systems.

  4. PubLayNet: The largest dataset ever assembled for document layout analysis, PubLayNet is a key resource for document understanding within computer vision.

  5. AVA Active Speaker dataset: This dataset is valuable for training models that can identify active speakers in video.

  6. KITTI dataset: Essential for research combining computer vision with robotics and autonomous systems.

  7. LVIS: This dataset is valuable for large vocabulary instance segmentation.

  8. Open Images: A broad source of images, suitable for self-labeling experiments.

  9. The Oxford-IIIT Pet Dataset: This dataset is designed for image classification.

  10. Dataset for the Challenge on Learned Image Compression 2020 (CLIC): This dataset is used for research on image compression.

III. Document Understanding Datasets

Document understanding datasets are used to train AI to process and extract information from various types of documents.

  1. ScreenQA: A large-scale dataset of question-answer pairs over mobile app screenshots, ScreenQA helps AI models understand the content and structure of mobile app interfaces.

  2. RICO dataset: The largest publicly available collection of mobile app screenshots, RICO is crucial for understanding mobile app interfaces.

  3. DocVQA: This dataset drives research in Visual Question Answering (VQA) on document images, enabling AI to answer questions based on the content of visual documents.

  4. FUNSD: This dataset is relevant to understanding forms in scanned documents.

  5. CORD: This dataset focuses on parsing information from receipts after Optical Character Recognition (OCR).

  6. LabelDroid: A dataset of UI screenshots from top-downloaded Android apps.

  7. ER-ICA: A foundational dataset in the field of UI understanding.

IV. Biomedical/Healthcare Datasets

Biomedical and healthcare datasets are revolutionizing medicine, enabling AI to diagnose diseases, personalize treatments, and accelerate drug discovery.

  1. MultiMedBench: This benchmark comprises open-source, de-identified datasets for various biomedical tasks, facilitating the evaluation of AI models in the medical domain.

  2. MultiMedQA: A collection of multiple-choice medical question-answering datasets, MultiMedQA is designed to assess the medical knowledge of AI models.

  3. MIMIC-CXR and MIMIC-CXR-JPG datasets: These large datasets of chest radiographs with free-text radiology reports are used for radiology report generation.

  4. SCIN dataset: A newly created dataset focused on skin conditions, SCIN addresses issues of representation and data quality in dermatological AI.

  5. MIMIC-III: A large, publicly available medical database containing intensive care unit records.

  6. TCGA whole-slide data and labels: Crucial for research in computational pathology, specifically for cancer genomics.

  7. DHMC LUAD whole-slide data and labels: Relevant for lung adenocarcinoma histologic pattern classification in pathology.

  8. SICAP whole-slide and tile data with corresponding labels: Used for Gleason pattern classification in prostate cancer.

  9. CheXpert: Used to annotate reports in MIMIC-CXR.

  10. CRC100k tile data and labels: Used for colorectal cancer tissue classification.

  11. WSSS4LUAD image tiles and labels: Relevant for weakly supervised semantic segmentation in lung adenocarcinoma.

  12. EBRAINS WSIs (Whole Slide Images): Valuable for research on brain tumors using histopathology images.

  13. AGGC (Adrenal Gland Genomics Consortium) WSIs: Relevant for adrenal gland cancer research using whole slide imaging.

  14. PANDA (Prostate cancer diagnosis via artificial intelligence) WSIs: Crucial for prostate cancer diagnosis using AI on whole slide images.

  15. Dataset of image–caption pairs from educational resources and PubMed: Relevant for developing foundational models for medical image understanding and report generation.

  16. Source A: Relevant for tasks combining medical images and text.

  17. Internal pathology images, corresponding reports and electronic records from Mass General Brigham: Highlighting the use of real-world clinical data in medical AI research.

  18. Dataset of clinically generated visual questions and answers about radiology images: Relevant for visual question answering in the medical imaging domain.

  19. Fitzpatrick 17k Dataset: Used for evaluating deep neural networks trained on clinical images in dermatology.

  20. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones.

  21. Datasets used for dermatology research: Including those for melanoma detection and general skin disease diagnosis.

V. Code Analysis Datasets

Code analysis datasets are used to train AI to understand, generate, and analyze source code.

  1. CodeQueries dataset: Its specific focus on semantic code understanding and question answering makes it highly relevant to the field of AI for code analysis.

  2. ETH Py150 Open dataset: Used as the source of Python code for the CodeQueries dataset.

  3. CodeQL queries benchmark: A standard suite of queries used in preparing the CodeQueries dataset.

VI. Table Understanding Datasets

Table understanding datasets are used to train AI to process and extract information from structured tables.

  1. WikiTQ dataset: Its role as a benchmark for table-based question answering makes it highly relevant to research in this area.

  2. Binder dataset: Used as a baseline for comparison on the WikiTQ dataset.

  3. Dater dataset: Used as a baseline for comparison on the WikiTQ dataset.

VII. Benchmarking/Evaluation Datasets

Many datasets serve as benchmarks for evaluating AI models.

  • SQuAD, NQ, WikiQA (for Question Answering)

  • ImageNet (for Image Classification)

  • WikiTQ (for Table Question Answering)

  • MultiMedBench, the MultiMedQA suite (for Biomedical AI)

  • ScreenQA (for UI Understanding)

VIII. Federated Learning Datasets

  1. Stack Overflow federated dataset: This dataset supports research in distributed and privacy-preserving machine learning.

IX. Other Datasets

  1. OSM graph files (Washington and Baden-Wurttemberg): Used in experiments.

  2. Facebook Post Reactions dataset.