According to state of AI, as the AI market heats up, NVIDIA has been shortening the lead time of next-generation data center GPUs since the launch of the A100 in 2020, while significantly improving the performance of TFLOPs (TFLOPs are a unit of computing power that represents the number of floating-point operations performed per second. In areas such as high-performance computing, supercomputers, graphics processing units, and deep learning, TFLOPs are an important performance indicator to evaluate the computing power of a system. This metric is critical for computationally intensive tasks such as scientific simulation, big data analytics, machine learning, etc., and the higher the TFLOPs, the stronger the hardware's ability to handle floating-point operations). The time from A100 to H100 has been reduced by 60%, and the cycle time from H200 to GB200 has been reduced by another 80%. During this time, TFLOP performance increased by a factor of 6. It is reported that many large cloud companies are purchasing GB200 systems in large quantities, with Microsoft's orders between 700,000 and 1.4 million, Google's 400,000, and Amazon's 360,000. OpenAI is also rumored to have purchased at least 400,000 GB200 units.
Figure: NVIDIA accelerates its product launch time while improving TFLOPs performance
Accelerate the connection between GPUs and nodes to improve cluster performance
For large-scale clusters, the speed of data communication between GPUs and nodes (up and out) is critical. NVIDIA's NVLink technology has dramatically improved cluster performance over the past eight years by dramatically increasing bandwidth, number of links, and number of GPUs per node. In addition, NVIDIA has further strengthened its market leadership by connecting large-scale clusters with InfiniBand technology. At the same time, China's Tencent is also actively innovating, launching the Xingmai 2.0 high-performance computing network, claiming to support a single cluster of more than 100,000 GPUs, improving network communication efficiency by 60% and LLM training performance by 20%. However, it is still unclear whether Tencent has built a cluster of this size.
Managing large clusters is challenging
Although clusters are scaling, running large clusters can still be challenging. When Meta released the Llama 3 series, it shared up to 8.6 work interruptions per day during the 405B pre-training period. GPUs are more susceptible to failure than CPUs, and each cluster is different, so continuous monitoring is critical. Misconfigurations, inadequate testing, and faulty components often affect the stability of a system, while low-cost power and network-rate availability are critical.
Figure: Challenges in managing large clusters
Big tech companies are accelerating their hardware R&D and weakening their dependence on Nvidia
In order to improve their competitiveness with NVIDIA, many large technology companies have accelerated the research and development of their own hardware. For example, Google has introduced the Axion chip, which is based on the Armv9 architecture, which delivers 30% higher performance than the fastest Arm general-purpose instance available. Meta has launched a second-generation AI inference accelerator with twice the compute and memory bandwidth of its predecessor, and plans to use it for future training tasks to generate AI. OpenAI is also recruiting talent from Google's TPU team and discussing with Qualcomm to jointly develop a new generation of AI chips.
Challengers in the field of AI chips are emerging
With Nvidia's dominance, challengers to AI chips are also actively vying for market share. Known for its wafer-level engines, Cerebras has planned an IPO and is aiming for a 15.6x increase in revenue of $136 million in the first half of 2024, with 87% of revenue coming from Abu Dhabi-based G42. Groq, which focuses on specialized chips for AI inference tasks, recently closed a $640 million Series D funding round at a valuation of $2.8 billion. Both Cerebras and Groq are focusing on speed and are aggressively entering the cloud services market to gain a foothold in NVIDIA's ecosystem.
SoftBank enters the AI chip market
SoftBank is also accelerating its entry into the AI chip market, with its Arm company planning to launch its first AI chip in 2025 and the possibility of acquiring British startup Graphcore for $60 billion to $70 billion. Although Arm has been involved in the AI field to a certain extent, its instruction set architecture is not fully suitable for the parallel processing needs of data centers, and it faces the advantages of NVIDIA in data centers. SoftBank also acquired Graphcore, a company focused on developing intelligent processing units designed to handle AI workloads more efficiently with less data than GPUs and CPUs.