Wednesday, September 27, 2023

Nvidia Announces the DGX GH200 for Generative AI

Nvidia has announced a new DGX class, the GH200, for generative AI workloads.

The DGX GH200 combines 256 Grace Hopper superchips into a single 144TB GPU system. The Superchip itself is a combination of Nvidia’s Grace Arm CPU and Hopper GPU, connected by an NVLink C2C chip-to-chip interconnect. Those superchips are then linked using the new NVLink switch system interconnect.

Together, the 256 superchips hold 144 terabytes of shared memory. The system is also available in 32-, 64- and 128-chip variants.

In addition, Nvidia plans to launch a new supercomputer, Helios, with four full-spec DGX GH200s, for a total of 1,024 Grace Hopper superchips.

Nvidia said Google Cloud, Meta and Microsoft will be among those gaining access to the DGX GH200 to explore its capabilities for generative AI workloads.

“Building advanced generative models requires innovative approaches to AI infrastructure,” said Mark Lohmeyer, vice president of computing at Google Cloud.

“The new NVLink scale and shared memory from the Grace Hopper superchips address key barriers in large-scale AI and we look forward to exploring its capabilities for Google Cloud and our Generative AI initiatives.”

Traditional DGX implementations have combined two x86 CPUs with eight GPUs, but this system has a 1:1 ratio. “What this brings, beyond the memory footprint, is a lot more processing power,” Charlie Boyle, vice president and general manager of Nvidia’s DGX Systems, told DCD.

“In an AI pipeline, there are parts that are highly parallel GPU operations, but there are always parts, whether it’s part of data preparation or image transformation, things that you might also need CPU resources for.” And so having a very strong CPU, coupled directly to a GPU, a) improves processing, but b) means that some of the things in the process, which you previously might have had to do on separate systems, are now on one Coherent systems can live on the architecture and pipe everything into it”.

Girish Bablani, Corporate Vice President, Azure Infrastructure at Microsoft, said: “Training large AI models has traditionally been a resource-intensive and time-consuming task. The DGX GH200’s ability to work with terabyte-sized data sets enables developers to scale Will enable advanced research to be done at a faster pace.”

Despite the high-density computing, Boyle told DCD that the GH200 is still fully air-cooled. “It was a big consideration in the system design when talking to our customers,” he said. “We know people will have to move to Liquid at some point, but we’re also hearing feedback from customers that it’s a challenge. For them, they don’t have data centers, they need to build new data centers.”

Added Boyle: “Even getting liquid-cooled gear, because we stay ahead of the curve, making our own stuff in house that’s liquid-cooled so we can test it for our customers.” , Even getting the liquid parts takes a long time.

Another customer request was that the system be used immediately. Boyle revealed that Nvidia will now use an integration facility to fully test everything and configure it to be ready to use as soon as it’s installed.

And he said customers are requesting larger deployments than ever before to support the enormous demands of generative AI. “People used to buy some systems, try them out, and then scale up the implementation,” Boyle said. “[Now]my teams are getting calls from customers saying, ‘When can you deliver hundreds of systems to me?'”

Nation World News Desk
Nation World News Desk
Nation World News is the fastest emerging news website covering all the latest news, world’s top stories, science news entertainment sports cricket’s latest discoveries, new technology gadgets, politics news, and more.
Latest news
Related news