Data Acquisition: Why It Still Matters

The appetite of modern AI models, particularly large language models (LLMs) and other foundation models, for data is a well documented phenomenon. However, the journey of acquiring the right data–data that is high quality, relevant, and ethically sourced–is fraught with challenges that persist despite the rapid advancements in AI technology. Our interviews reveal a world of AI research that is pushing the boundaries yet still struggling with the fundamental building blocks.

Enduring Pain Points in Data Acquisition

The medical field starkly illustrates the enduring difficulties in data acquisition. Researchers working on cutting edge biomedical AI models consistently highlight the lack of suitable datasets as a major impediment. As one researcher interviewed explicitly stated, "Not having enough data" is the "biggest bottleneck in achieving even better research results." The scarcity is not just about volume but also about the specific types of data needed. For instance, the same researcher lamented, "MRI datasets just don't exist", pointing to the challenge of obtaining specialized medical imaging data necessary for training advanced diagnostic models.

Furthermore, the privacy and regulatory landscape surrounding medical data adds another layer of complexity. Hospitals, as potential sources of this invaluable data, are often hesitant to share it with large technology companies due to reputational risks. According to the interview, "Hospitals actually don't want to give their data to a big tech company because it's a bad look," creating significant hurdles for researchers at larger tech companies. Even when data is available, the cost can be prohibitive, with hospitals sometimes selling datasets "for millions of dollars," a budget inaccessible to smaller research teams. This environment necessitates creative approaches to data acquisition and highlights the critical need for partnerships and collaborations with institutions while navigating stringent privacy regulations.

Synthetic Data: A Game-Changer for Robotics and 3D Modeling

In response to the limitations of real world data, synthetic data generation has emerged as a promising avenue, particularly in fields where generating realistic simulations is feasible. For example, one research team created synthetic images from a video generation model in order to teach AI object permanence and how to model 3D object rotations. Capturing such data in the real-world would otherwise take expensive equipment, setup costs, and extra time. Thus, synthetic data generation - though lacking an explicit physics model of the world - can alleviate much data acquisition pain points for 3D AI researchers.

The benefits of synthetic data are particularly evident in areas like 3D modeling and robotics, where creating diverse and well annotated datasets through physical collection can be expensive and time consuming. The PhoMoH research paper demonstrated the impressive generation of 450K training examples from only 1000 images. The ability of LLMs to generate synthetic datasets for training smaller, cheaper, fine tuned models for inference is a significant development. This trend suggests a future where LLMs might be more valuable for their data generation capabilities than for direct inference in certain cost sensitive applications, as mentioned by a leading AI researcher. This vision is far more equitable to research teams no matter what size or budget: large foundation models could be used to generate training datasets for custom models.

One paper, "Observations on Synthetic Image Distributions with Stable Diffusion," explores the potential and limitations of using synthetic image data for model training. While this research found that synthetic data does not yet scale as effectively as real data for supervised classification tasks like ImageNet, it also highlights several scenarios where synthetic data proves advantageous. These include better scaling behavior in certain classes, effectiveness when real data is scarce (as in CLIP training with limited datasets), and the potential for models trained on synthetic data to exhibit superior generalization to out of distribution data.

However, the quality of synthetic data remains a critical concern. Synthetic data is "junk food" in the context of data quality, implying a potential for a false sense of security if not carefully evaluated. One researcher noted that inspecting a model’s latent embedding space showed that when trained on too much synthetic images of a chair, the model’s ‘mental model’ of a chair shifted. This is analogous to eating fast food pizza and over time, believing this fast food pizza is the epitome of ‘pizza’ as a concept.

These challenges and solutions of synthetic data underscores the need for ongoing research to improve the realism and fidelity of synthetic data.

Other Trends Beyond Synthetic Data

Regardless of the allure of synthetic data, the overarching trends in data acquisition underscore the primacy of data quality, the increasing demand for customized datasets, and the growing importance of data flywheels.

Quality over Quantity: The historical emphasis on data volume is rapidly shifting towards a focus on data quality. Interviewees emphasize that "Quality is the big obstacle. . . [You need] a lot of high quality data... there’s no shortcut". While data cleaning and feature engineering have traditionally addressed quality concerns, the evaluation of data quality in the context of generative modeling is less straightforward. The lack of consensus frameworks and standardized metrics for defining "quality" remains an open problem, as pointed out in several research papers. As one researcher noted in her interview, "Quality is important but nobody knows how to define quality -- and who should define quality?" This ambiguity necessitates a more nuanced understanding of data quality, often requiring ablation studies to determine if a dataset truly improves model performance. Or, beyond ablation studies, A/B tests with end users.
Niche Datasets: The need for datasets tailored to specific tasks and domains is becoming increasingly crucial. Generic, broadly scraped data may not adequately address the nuances of specialized applications. This is evident in the medical domain, where the distribution of conditions in general datasets often differs significantly from the target population for a specific diagnostic tool. Similarly, for tasks like voice conversion, the need for "diverse high quality data" is a significant bottleneck. The ability to create custom crawlers to target specific data based on URL patterns or engage expert networks to acquire niche, high quality data are becoming essential strategies.
Rise of Customized Data Acquisition Strategies: Generic datasets are often insufficient for specialized AI applications, driving the need for customized data acquisition strategies. These strategies include creating niche-targeted web crawlers, leveraging expert networks for access to specialized knowledge, forming partnerships with academic institutions and hospitals (while navigating legal frameworks), and even crowdsourcing data through innovative methods like online advertising.
Data Flywheels: The concept of a "data flywheel," where model performance is used to iteratively refine the data acquisition and preparation processes, is gaining traction. This feedback loop, exemplified by Gemini's quality filtering based on model predictions of URL usefulness, allows for a more dynamic and targeted approach to data acquisition. By continuously evaluating the impact of data on model performance, organizations can optimize their data pipelines and focus on acquiring data that yields the most significant improvements. This iterative process helps to move beyond simply acquiring large volumes of data towards a more strategic and quality driven approach.
Trends towards and against Open-Source and Academia: While the AI research community often values open-sourcing datasets and models, private industry, particularly large tech companies, often hold back their most cutting-edge work due to competitive advantages and production use. This can create a "lagging effect" in publicly available research and potentially stifle innovation in academia and smaller organizations due to limited access to data and compute.
Compute as a Bottleneck Intertwined with Data: Access to sufficient compute power is a significant bottleneck for training large AI models, even within large companies. This is intrinsically linked to data, as larger datasets and more complex models require more computational resources.

Other Challenges in Data for AI:

Persistent Scarcity and Specificity: Despite the vast amounts of data being generated, there's a persistent scarcity of high-quality, domain-specific data for training advanced AI models, particularly in specialized fields like medical imaging. The need is not just for volume but also for data that accurately represents the target use case and population. The distribution of conditions in general medical datasets, for example, may not align with the needs of a specific diagnostic tool. Obtaining data for less common tasks or modalities (like high-quality hair scans for 3D modeling or diverse multi-answer question datasets) remains a significant hurdle
Double-Edged Sword of Synthetic Data: Current synthetic data techniques have limitations in realism and may not scale as effectively as real data for all tasks, particularly supervised classification on complex datasets like ImageNet. One perspective is that synthetic data can be "junk food" if data quality isn't carefully considered. The effectiveness of synthetic data is highly dependent on the domain and the fidelity with which the real world can be simulated.
Privacy, Legal, and Ethical Constraints: Privacy regulations and ethical concerns significantly complicate data acquisition, particularly for sensitive data like medical records. Hospitals are often reluctant to share data with large tech companies due to reputational and regulatory risks. Legal hurdles can prevent the use of valuable internal datasets, even if they could improve model quality. De-identification processes for sensitive data can be complex and manual. Bias in training data is a major concern, and while it's often inevitable to have some human bias, downstream models need careful design and evaluation on diverse populations to avoid perpetuating harmful biases.
Challenges in Data Tooling and Infrastructure: Despite advancements in AI models, the tools and infrastructure used by data practitioners sometimes lag behind. Basic tools like spreadsheets and Python notebooks are still prevalent. There's a demand for better sensemaking tools for dataset analysis that are not being widely adopted.

Conclusion

In conclusion, data acquisition in the age of AI is a complex landscape marked by persistent challenges, particularly in specialized domains like medical research. While synthetic data offers a promising alternative in certain areas, the focus on data quality, the need for customized datasets, and the implementation of data flywheels are key trends shaping how organizations approach this critical aspect of AI development. Overcoming the bottlenecks in data acquisition will require not only technological advancements but also strategic partnerships, innovative sourcing methods, and a clearer understanding of what constitutes high quality data for specific AI applications.

Data Challenges and Trends for Foundation Models