Nearly every AI researcher and AI practitioner we interviewed mentioned compute as one of the main bottlenecks to further AI foundational model advancements. But does this mean we simply need better, faster GPUs, CPUs, and TPUs?
We find the real bottlenecks in this supposed bottleneck are not the chips themselves, but everything beyond the actual chips.
While the relentless pursuit of more powerful AI models continues, the infrastructure surrounding those powerful chips—energy, networking, installation, and cooling—often dictates the true pace of advancement. We provide an analysis of the trends and opportunities within this critical area.
Compute's Hidden Bottlenecks
Data Centers: Constrained by energy providers
One of the most immediate and impactful bottlenecks is securing sufficient energy. For instance, CoreWeave worked with local authorities to procure a data center space and designed the datacenter from the ground up.
Modern AI data centers are incredibly power-hungry, often requiring tens of megawatts. Harshdeep Banwait from CoreWeave highlighted this, stating, "There are not enough data center spaces globally that can offer the required tens of megawatts of energy, and cities are not always ready to supply such high energy demands." This isn't just about availability; it's about the process of securing energy contracts.
This process can be complex and time-consuming, often involving lengthy negotiations with utility providers for Power Purchase Agreements (PPAs). Utility interconnect is a crucial aspect of data center site selection, indicating the importance of these energy agreements. What are utilities going to need to sign a PPA becomes a key question for those in the data center world, where negotiation and specific requirements are not publicly transparent. This scarcity of suitable locations and the competition for energy resources contribute to often secretive dealings through shell companies operating on behalf of large tech companies.
Inter-Chip Communication is a Bottleneck
Even with sufficient power, the ability of chips to communicate with each other within a massive compute cluster presents a significant hurdle. Jensen Huang of Nvidia has emphasized that building software to distribute computation across thousands of GPUs (each containing thousands more GPUs) is a fundamental innovation. This requires incredibly high-bandwidth and low-latency interconnects.
Nvidia's NVLink Switch Chip, with its 50 billion transistors, is a direct response to this challenge. It allows every GPU to communicate with every other GPU at full speed. The scale is staggering: a DGX GB200 NVL72 system, considered "1 giant GPU," supports a bandwidth greater than the entire internet, requiring miles of high-performance cables. As one interviewee explains, "Bottleneck of compute is not necessarily compute itself but rather communication between the two chips...you reach the physical limits of the network." Even with the most powerful chips, the network connecting them can become the limiting factor.
People and Process Bottlenecks: Liquid Cooling, Installation, Maintenance
The process of actually setting up and running these data centers, or AI factories as Jensen Huang likes to call them, is often underestimated. Building data centers and deploying thousands of interconnected, power-hungry chips is a complex engineering feat. Specialized requirements and logistical planning become key.
Liquid cooling, in particular, is a major concern. Modern high-performance compute racks generate immense heat, making advanced cooling solutions essential. Liquid cooling is increasingly becoming the standard, but, as one researcher puts it, “We think the market needs to go to liquid cooling a lot faster than what data centers are ready for. We are trying to standardize our data centers with liquid cooling but the skills, the plumbing required - we don't think the world is ready for mass adoption of liquid cooling just yet. The technology is there theoretically but skills and standards gaps exist." This lack of standardization can lead to inefficiencies, higher costs, and potential downtime.
On the Horizon: Potential Solutions and Emerging Trends
Nuclear Fusion: While speculative, the potential of nuclear fusion to provide abundant, clean energy could be transformative. If nuclear fusion solves energy challenges, it could significantly impact the cost of training and serving AI models, where power is a major expense. Early internet pioneers like Bob Metcalfe are known to support nuclear fusion startups.
Floating Data Centers: These facilities, typically housed on barges or platforms in bodies of water like oceans, seas, or large lakes, leverage the surrounding water for natural cooling. Nautilus Data Technologies is a pioneer in this field, with its first water-borne data center commissioned in Stockton, California, on the San Joaquin River. However, challenges remain, including regulatory hurdles related to environmental impact on marine ecosystems, ensuring robust connectivity (using methods such as submarine cables), and addressing potential maintenance issues related to saltwater corrosion.
More Efficient Architectures: There's a growing recognition that current model architectures, particularly Transformers, may be inefficient. Christian Szegedy, a Research Scientist at xAI, suggests, "with transformers, we might be wasting a lot of hardware," and that more sparse, energy-efficient architectures are likely to exist. The observation that current models require a trillion parameters for simple tasks like "2+2" indicates significant room for optimization.
Software Accelerators and Parallelization: Distributed computation across numerous GPUs is already a cornerstone of modern AI training. "Pathways" by Jeff Dean et al emphasizes efficient parallelization of training tasks for AI models in the acceleration layer - which sits between hardware (TPUs, GPUs) and models for training. Continued advancements in this area are crucial for maximizing hardware utilization.
Improved Reliability of Infrastructure: One data center provider states, "Hardware very frequently fails. It's up to the cloud provider to detect that failure and help our customers deal with their failures. If you went to a traditional legacy cloud provider, you'd have a failure, you'd have to wait to detect it, you'd have to file a support ticket."
Compression Techniques: Advances in compression could significantly reduce the compute burden. One research scientist points out that while image compression is mature, video and 3D compression still have substantial room for breakthroughs. Compressing the massive datasets, particularly for video modalities, used in AI training and inference could reduce bandwidth requirements and overall compute needs. But goal posts keep shifting. Around 2015, neural image compression was state of the art. Then in 2018, video compression became the focus. The challenge is ever-increasing amounts of data.
Demand for Compute will continue to grow
The evolution of AI's data appetite has been a story of continuous expansion. What began with text-based models quickly progressed to image-based applications, each requiring a significant leap in computational resources. Now, with the explosion of video content, and the imminent arrival of 3D modeling, robotics, spatial computing, and other immersive experiences, the demand for compute is poised to increase not incrementally, but exponentially. Each new data modality represents a step-change in complexity, driving an insatiable need for more processing power.
Conclusion
The path to more powerful AI foundational models is not solely paved with faster chips. While silicon advancements are essential, the surrounding ecosystem—energy infrastructure, high-speed networking, streamlined installation, and efficient cooling—presents a complex web of bottlenecks. Addressing these multifaceted challenges will be crucial for unlocking the full potential of AI. While advancements in model architectures and compression techniques offer hope for increased efficiency, the shift towards richer, more complex data modalities ensures that the overall need for computational power will continue its relentless upward trajectory. The future of AI depends not just on building better chips, but on building a better system around them.