Nvidia is bolstering the capabilities of its InfiniBand networking platform to address the increasingly common requirements for both cloud services providers (CSPs) and supercomputing centers around performance, security, and the ability to run such modern workloads as artificial intelligence (AI), data analytics, and high-performance computing (HPC).
At the company’s virtual GTC event, founder and CEO Jensen Huang unveiled Quantum-2, a 400 Gigabit Ethernet InfiniBand networking platform not only will include Nvidia’s Quantum-2 switch, but also software for supporting it and a choice of the vendor’s ConnectX-7 network interface controller (NIC) or upcoming BlueField-3 data processing unit (DPU).
The new networking platform comes as supercomputing facilities increasingly are being accessed by more users—many from outside of their organizations—while CSPs like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud are offering more supercomputing services to organizations, according to Nvidia officials.
“Quantum-2 is the first networking platform to offer the performance of a supercomputer and the shareability of cloud computing,” Huang said during his GTC keynote address. “This has never been possible before. Until Quantum-2, you get either bare-metal high-performance or secure multi-tenancy, never both. With Quantum-2, your valuable supercomputer will be cloud-native and far better utilized.”
Computing Becoming More Distributed
As computing becomes more distributed between on-premises data centers, multiple public clouds and the fast-growing edge, the network becomes the key connectivity tool not only for moving the data around, but also to manage and secure it.
“In distributed computing, the network is the vital central nervous system of the computer,” Huang said. “The network connects thousands of GPUs into a giant supercomputer, determining its scalability and ultimate performance.”
The Quantum-2 includes a new 7-nanometer InfiniBand switch chip—about the size of Nvidia’s A100 GPU—and 64 ports at 400 Gb/s or 128 ports at 200 Gbp/s, the CEO said. It can connect up to 2,048 ports, a significant jump over the 800 ports in Quantum-1, delivering more than five times the switching capacity. In addition, Quantum-2 can scale to 1 million endpoints within the three-hop Dragonfly interconnect topology, 6.5 times over current generation, he said.
“Nanosecond timing will also allow cloud data centers to become part of the telecommunications network and host software-defined 5G radio services,” Huang said. “If Nvidia’s Selene DGX supercomputer were equipped with Quantum-2 today, the total bandwidth would be 224,000 Gigabytes per second, or roughly one and a half times the total traffic over the internet.”
At 400 Gb/s, Quantum-2 doubles the network speed and triples the number of network ports. It triples the performance and reduces the need for data center fabric switches six-fold, while reducing data center power consumption and data center space by 7 percent each, he said.
Quantum-2’s New Features
Among the key new features is performance isolation that keeps the activity of one tenant from disturbing others and a cloud-native, telemetry-based congestion-control system that ensures that high data-rate senders don’t overwhelm the network and jam traffic. It provides SHARPv3 in-switch processing with 32 times the acceleration engines to speed up AI application training, while a nanosecond precision timing system can synchronize distributed applications, including database processing, which lowers the overhead of waiting and handshaking within the network.
The system provides predictive maintenance capabilities via Nvidia’s UFM Cyber-AI platform.
Nvidia is offering two networking and endpoint options for Quantum-2. ConnectX-7 will come with 8 billion transistors and doubles the rate of ConnectX-6 and doubles the performance of remote direct memory access (RDMA), GPUDirect Storage, GPUDirect RDMA and in-networking computing, according to Nvidia officials. The NIC is sampling in January.
Meanwhile, BlueField-3 InfiniBand will include 22 billion transistors and 16 64-bit Arm CPUs to offload and isolate the data center infrastructure stack. It will sample in May 2022.
Rise of Data Processing Units (DPUs)
A range of semiconductor vendors—not only Nvidia, but also Intel, Broadcom, Marvell, Hewlett Packard Enterprise’s Aruba Networks, and Xilinx (which AMD is trying to buy for $35 billion)—are leveraging such technologies as field-programmable gate arrays (FPGAs) to develop DPUs, which offload networking, storage and other tasks from the CPU to accelerate performance.
Nvidia, which broadened it networking capabilities when it bought interconnect vendor Mellanox for $6.9 billion in 2019, sees the BlueField DPUs as a way to move an array of such tasks from the CPU that Nvidia officials say are eating up as much as 30 percent of a computing chip’s capacity. Huang said there are about 1,400 developers working with BlueField DPUs.
At the show, Nvidia also announced BlueField DOCA 1.2, a collection of cybersecurity capabilities that will enable enterprises to more quickly build a zero-trust architecture by offloading infrastructure software.
“Protection at the perimeter and workgroup segmentation are no longer sufficient,” Huang said. “Every touch point of applications, data, users and devices are potential attack surfaces. Since BlueField is the networking endpoint, we can secure a data center at virtually every touch point.”
Also read: Steps to Building a Zero Trust Network
Security Concerns Rise
He noted that both cloud computing and machine learning are changing the nature of data centers and that container-based applications enable hyperscalers to rapidly scale out their environments and bring aboard millions of users to take advantage of their services at the same time.
“The ease of scale out and orchestration comes at a cost: east-west network traffic increased incredibly with machine-and-machine message passing and these disaggregated applications open many ports inside the data center that need to be secured from cyberattack,” he said.
The BlueField DOC 1.2, which will into early access a the end of the month is part of a larger zero-trust security platform that also includes Nvidia’s Morpheus, a deep-learning framework that offers a new workflow for creating digital fingerprints to detect and respond to anomalies in the network
The Quantum-2 switch is available from a wide array of infrastructure vendors and system makers, including Dell Technologies, HPE, Lenovo, IBM DataDirect Networks and Inspur.
Read next: Top Zero Trust Networking Solutions for 2021