NVIDIA drops another big one: generative AI on the cloud

With the rapid development and widespread adoption of AI technology, more and more organizations and enterprises are starting to migrate their AI workloads to the cloud to better leverage the elasticity and scalability of cloud computing. after the explosion of big model applications like ChatGPT, generative AI is also starting to migrate to the cloud.
What kind of requirements are posed by generative AI on the cloud?
InfiniBand is undoubtedly the best choice, but some users are more willing to use Ethernet to provide services to their customers. There are some users who would prefer to use Ethernet for a generative AI cloud. In response, NVIDIA recently launched Spectrum-X, the world's first Ethernet networking platform and network architecture built specifically for generative AI, making it easy for customers to build their own generative AI or AI workloads in a cloud environment and improve network performance.
Today, NVIDIA has pioneered two distinct AI-based application scenarios:
1) In the large-scale, big-crunch, high-performance scenario, NVIDIA has created a new network application scenario - the AI factory-oriented application scenario - where the golden combination of NVLink + InfiniBand networking is commonly used. Some of the recent successes of large language models are using this high-performance NVLink+InfiniBand lossless network architecture, which provides mega-scale, high-performance data centre computing power for training large models, supporting the demand for computing power and performance of large language models;
2) For AI cloud applications, first of all, if it is just an ordinary cloud computing requirement, a traditional Ethernet network will do. And for multi-tenant, workload diversity, application scenarios that need to incorporate AI and generative AI, the new Spectrum-X high-performance network platform can now be used to enable customers to build their own generative AI or AI workloads in a high-performance cloud environment.
"The network has become the compute unit in the generative AI or AI factory, just as the InfiniBand network is not only responsible for communication but is also part of the compute unit in the AI training process. Therefore, we need to consider not only the computational power provided by the CPU and GPU, but also the computational power of the network." Song Qingchun noted.
Why is NVIDIA adding a separate network architecture to support AI workloads in cloud scenarios? According to Cui Yan, NVIDIA's network technologist, traditional Ethernet is used for network management or user access to the network, and applications are loosely coupled. AI workloads, on the other hand, are more sensitive to bandwidth, latency and jitter. As a result, different cloud scenarios require different Ethernet networks: Ethernet for enterprise applications needs to support many complex network functions, but has a low demand for network bandwidth and more north-south network traffic; Ethernet on the cloud, which requires higher bandwidth but no complex functions, with inefficient loads running on it that will have long-tail latency and jitter; and Ethernet specifically for the telecom industry Ethernet, which needs to support completely different features than the data centre; and for generative AI, none of the previous three networks are the best choice. NVIDIA Spectrum-X creates new network features and uses that accelerate generative AI and deliver high performance at scale and under high load.
Traditional Ethernet networks are better suited for north-south data-dominated traffic and application access. AI requires hundreds of GPUs to handle a single or very few AI workloads and computations, and more east-west, distributed, tightly coupled applications, which place higher demands on data transfer. The new network architecture, Spectrum-X, is designed to address the acceleration and performance needs of east-west traffic within the data centre.
Spectrum-X's four key features enable massive, scalable
The Spectrum-X networking platform combines NVIDIA Spectrum-4, BlueField-3 DPUs, LinkX cables/modules and acceleration software. The Spectrum-4 system features 100 billion transistors, TSMC's 4N process, a switching bandwidth capacity of 51.2Tb/s, a 100G SerDes, support for 64 800G ports or 128 400G ports, and end-to-end optimization with the BlueField-3 DPU. With the four main features of lossless network, dynamic routing, traffic control and performance isolation can realize a new AI cloud network.
Cui Yan explained that Spectrum-X is based on lossless Ethernet for dynamic routing, and focuses on RoCE optimisation for AI service oriented aggregate communication library (NCCL). The principle is that data is sent to the switching network via the BlueField-3 DPU at the sender side, then the Spectrum-4 switch distributes the packets to all the best available routes, and the BlueField-3 DPU at the receiver side performs data scrambling. With an effective data throughput increase of 1.6 times compared to traditional Ethernet, every link can be fully utilised.
Lossless networking is a very important technology for generative AI. "For large models, it is often necessary to invest months or weeks in a single training session, and if something goes wrong, the training needs to be retuned or interrupted, which is an almost unacceptable cost. Traditional Ethernet's packet loss retransmission mechanism is unacceptable in training, as retransmissions cause training to slow down." said Qing Meng, Director of Network Marketing at NVIDIA.
Concluding remarks
"Generative AI must be a market where performance is king, so our focus is on performance. infiniBand + NVLink is definitely the best performance, then Spectrum-X, and the lowest performance is traditional Ethernet." Song said, "Spectrum-X allows cloud customers to get good performance and still enjoy good convenience, avoiding the result of using traditional Ethernet that leads to such a large investment in AI infrastructure without getting relatively good performance."
Spectrum-X is a new market created by NVIDIA, and it's another big move that NVIDIA is offering in the big model application space. It will significantly save on training costs for generative AI, shorten training time and get big models to market as quickly as possible. To quote NVIDIA founder and CEO Jen-Hsun Huang: the more you buy, the more you save.