AI has placed unprecedented demands on the underlying infrastructure and compute resources. This document outlines the key pain points, major shifts, remaining challenges, and future opportunities in the domain of AI infrastructure and compute.
Pain Points
The journey of AI from research to real-world application is often impeded by significant bottlenecks in infrastructure and compute. Several key pain points have been identified:
Limited Access to Sufficient Compute: A fundamental challenge, particularly for academic institutions and smaller research teams, is the lack of access to the massive computational power required to train and run state-of-the-art AI models. At Stanford HAI at Five, Fei Fei Li noted the difficulties universities face in getting GPUs. A biomedical AI researcher mentioned that researching AI these days is now "more expensive because the compute power needed to run inference and train these models is higher". This disparity in access can stall progress and limit the diversity of research contributions.
High Cost for Large Models: Even when access is available, the cost associated with the necessary compute resources can be prohibitive. Training large models often requires significant investment in cloud computing services or dedicated hardware. One researcher highlighted that while "you can fine-tune Gemini,” it is “much too expensive to serve.”
Inefficient Use of Compute: Simply having access to powerful hardware is not enough; efficient utilization is crucial. One researcher pointed out, "It’s one thing to get GPUs, it’s another to use them effectively for a problem". He further elaborated that "you can write a software very inefficiently or very efficiently. If you write your software effectively, you can 10x your GPU efficiency," emphasizing the need for software innovations alongside hardware acquisition.
Operating Large-Scale Compute Clusters: For organizations that do possess large compute clusters, operating and maintaining them presents its own set of challenges. Harshdeep Banwait, Director of Product at CoreWeave, explained: "The challenge is how do you operate a large, super computer for a long time. How is it maintained? Fixed? Updated? For traditional cloud computing, certain aspects might fall apart. But new foundational model structure relies on the entire nodes to work. One node can cause the entire training to halt".
Energy Available in Concentrated Areas: The massive compute required for AI necessitates large data centers, which consume vast amounts of energy. Banwait identified this as a critical pain point: "All of these data centers that we need to install these GPUs and large clusters - they require tens of megawatts of energy. There are not enough data center spaces in the US or World that can offer that amount of energy. The world just isn't ready for the amount of energy needed in concentrated areas to enable this.” He noted this, as many other researchers and startups did, as one of the biggest bottlenecks to further foundation model improvements.
Trends: Democratizing Access to Compute
The AI landscape has witnessed significant shifts in how infrastructure and compute are approached:
Emergence of Specialized Cloud Infrastructure: Traditional cloud computing platforms are being augmented and, in some cases, challenged by providers specializing in GPU-accelerated workloads. One provider explained, “We are priced like a traditional cloud provider - you pay us per hour. We help customers get access to large scale compute instantaneously". This shift acknowledges the unique demands of AI and machine learning.
Focus on Efficient Interconnects and Networking: As model sizes and training data volumes grow, the need for high-bandwidth, low-latency communication between compute nodes has become paramount. CoreWeave explains their efforts in this space: "We designed our own backbone and clusters in a way so that these GPUs are able to talk to each other. We use Nvidia Infiniband networking so that these GPUs work as one super brain. Even though they are individual components, they communicate through an interconnected fabric so they can act as brain cells". This emphasis on specialized networking is crucial for scaling AI training effectively.
Public and Private Sector Initiatives for Compute Access: Recognizing the importance of broad access to AI compute, initiatives like the National AI Resource (NAIR) and state-level efforts like Empire AI are emerging. The NAIR effort aims to establish a national AI cloud for the public sector, while the Empire AI initiative provides "$400M funding for compute infrastructure for universities in NY State". These initiatives aim to democratize access and foster innovation in both academia and the public sector.
Remaining Challenges
Inter-Chip Communication Bottlenecks: The speed and efficiency of communication between individual processing units become critical limitations. One researcher stated, "[The] bottleneck of compute is not necessarily compute itself but rather communication between the two chips". Overcoming these bottlenecks is essential for further scaling training and inference.
Sustainability and Energy Consumption: The environmental impact of the massive energy consumption of AI data centers remains a significant concern. Finding more sustainable and energy-efficient ways to power AI infrastructure is a crucial ongoing challenge, coming at odds with customers seeking to deploy on the largest, latest, greatest models.
Lack of Standardization: High-density compute environments often require advanced cooling solutions like liquid cooling. However there is a lack of standardization: "We think the market needs to go to liquid cooling a lot faster than what data centers are ready for... There's no set way or standard for liquid cooling that data center providers have adopted. Every site I need to work with the local data center provider based on what they're capable of doing". This lack of standardization hinders wider adoption and efficient deployment of liquid cooling technologies.
Effectively Utilizing Available Compute: Even with increased access, ensuring that researchers and practitioners can effectively utilize the available compute resources remains a challenge. This includes developing efficient algorithms, optimizing software, and having the necessary engineering expertise.
Future Opportunities
Development of More Energy-Efficient Architectures: Research into novel computing architectures beyond traditional CPUs and GPUs, including neuromorphic computing and specialized AI accelerators, holds the potential for significant gains in energy efficiency. As one researcher put it, "There are more energy efficient architectures than Transformers probably.”
Advancements in Interconnect Technology: Breakthroughs in interconnect technologies, such as photonics, could alleviate the communication bottlenecks between compute units, enabling more efficient scaling of AI systems.
Advancement Standardization of Data Center Techniques: Standardizing and improving the accessibility of liquid cooling and other advanced thermal management technologies will be crucial for deploying increasingly power-dense AI hardware.
Public-Private Partnerships for Democratizing Access: Continued and expanded collaborations between public and private sectors can help bridge the compute gap, providing researchers and smaller organizations with access to the resources they need. Siegel emphasized the importance of "private sector partnership too, not just federal funding" for initiatives like Empire AI.
Software Innovations for Efficient Resource Utilization: Further advancements in algorithms, software frameworks, and resource management tools will be critical for maximizing the efficiency of available compute resources. Siegel noted that the academic world needs to "go beyond just getting GPUs. Think about software innovations" to effectively utilize compute.
Conclusion
While significant progress has been made, persistent pain points and emerging challenges remain. Addressing these issues through technological innovation, strategic partnerships, and a focus on efficiency and sustainability will be crucial for realizing the full potential of AI across various domains.