Home
News
Products
Corporate
Contact
 
Thursday, November 21, 2024

News
Industry News
Publications
CST News
Help/Support
Software
Tester FAQs
Industry News

Intel Paints Their AI Future with Gaudi 3


Friday, May 3, 2024

At its Vision event in Phoenix, Arizona, this week, Intel announced their third-generation data center AI accelerator, Gaudi 3. In so doing, Intel is painting a future AI accelerator competitive landscape with three viable options: itself, AMD and, of course, Nvidia.

Targeted at large-scale, accelerated data centers designed for AI training and inferencing workloads, Gaudi 3 is a two-die chip with 64 fifth-generation tensor processor cores and eight matrix math engines using Taiwan Semiconductor Manufacturing Co.’s (TSMC’s) 5-nm process. It also has 128GB of high-bandwidth memory (HBM) capable of 3.7 TB/s bandwidth and 96 MB of SRAM communicating at 12.8 TB/s.

On the networking side, Gaudi 3 supports 24 200GbE RoCE ethernet ports and 16 PCIe 5.0 channels. It is being offered in three different form factors: an OCP (Open Compute Project) Accelerator Module (OAM) compliant accelerator card, a universal baseboard and a PCIe add-in card.

What Does Gaudi 3 Look Like?

As an accelerator card, Gaudi 3 delivers 1,835 TFLOPs of AI performance using the 8-bit floating point (FP8) data format. With its on-chip networking capabilities paired with associated network interface cards, Gaudi 3 delivers 1.2 TB/s bi-directional communications. This networking capability allows all-to-all communications, which enables the universal baseboard form factor to support eight accelerator cards while still acting as one accelerator.

This form factor provides 14.6 PFLOPs of performance using FP8, greater than 1 TB of HBM2e with a 29.6 TB/s HBM bandwidth, and 9.6 TB/s of bi-directional networking. In its PCIe form factor, Gaudi 3 comes in a dual slot, 10.5-inch-long package operating at a passive cooled 600W TDP.

Supporting all of this is the Gaudi Software Suite. The suite includes firmware, tools and APIs at the emulation, framework and model layers. Software support also extends past the frameworks and models into an AI application layer that supports common AI functions, such as 3D/video/image/text generation, classification, summarization, translation, sentiment analysis and question and answer interactions.

With some exceptions—as can be inferred from the list above—Intel focused this layer primarily on generative AI workloads based on multi-modal models, large language models and retrieval augmented generation.

Surmounting the insurmountable with Gaudi 3?

No one can argue that Nvidia has established a formidable lead in data center AI acceleration based on performance, ability to scale to larger models and developer ecosystem. With Gaudi 3, Intel is attempting to close that gap on all three fronts.

For ease of comparison—normalizing Gaudi 3’s performance to a single accelerator card—Intel claims that for training workloads, Gaudi 3 was able to complete training a Llama2 13 billion parameter model up to 1.7× faster than Nvidia’s H100. For inferencing workloads using models like the Falcon 180 billion parameter and Llama 70 billion parameter models, Intel claims Gaudi 3 is, on average, 1.3× faster than H200 and 1.5× faster than H100, while being up to 2.6× more power efficient.

Additionally, given its 256 x 256 matrix math engines, as well as its SRAM architecture, Gaudi 3’s efficiency is maximized when working with longer input and output sequences. This bodes well for Intel given TIRIAS Research’s expectations that prompts, as well as generated outputs, will continue to grow in length as users continue to demand more context and specificity for increased relevance.

According to Intel, Gaudi 3 was designed to be able to “scale up and scale out” from the chip-level through to a cluster-level implementation. One of the fundamental design tenets that enables this is the ability of all compute resources, at the chip level, to have access to all memory resources simultaneously. Whether it is at the chip level between the two die within Gaudi 3 enabling it to act as one die, or if it is at the board and cluster levels with its high-speed ethernet interconnects allowing multiple Gaudi 3 or racks of Gaudi 3s to serve as one accelerator, this design tenet is present throughout the product family.

Intel developed four reference architectures ranging from one node consisting of eight accelerator cards (basically a universal baseboard configuration) all the way up to a 1024-node cluster consisting of eight 192 accelerator cards with compute, memory and networking bandwidth performance scaling accordingly.

The other barrier that Nvidia is enjoying is that of their mature and entrenched software ecosystem. Intel is attempting to lower this barrier to entry on the software side by making it as easy as possible to port existing Nvidia-based software and models to Intel’s environment.

To this end, Intel has built in API support at both the emulation and framework levels. For the former, Intel has included the Habana Collective Communications Library (HCCL), which is an emulation layer for the Nvidia Collective Communication Library (NCCL). At the framework level, the software suite has PyTorch API support to enable access to hundreds of thousands of generative AI models. With these capabilities, Stability.ai—Intel’s current marquis Gaudi 2 customer—has stated that it only took them less than one day to port over their models. Intel expects that Gaudi 3 customers will have a similar experience as Stability.ai.

The new AI accelerator landscape?

What makes Gaudi 3 a viable option is not just the market’s inherent need for viable options nor even just the performance of the chip. It is also its scalability with both east-west (e.g. rack to rack communications within the same server/datacenter) and north-south interconnects (e.g. communications to external networks or other datacenters), the complete set of form factors of the product family going from card to baseboard to PCIe, and last but certainly not least, the full software stack from firmware to drivers, APIs, model/framework support through to AI application support.

Perhaps out of all of this, given Nvidia’s current software ecosystem position, the Gaudi Software Suite’s API capabilities and especially its HCCL emulation layer will prove to be the most beneficial to Intel’s aspirations in this market.

The most telling indicator of how viable of an option Gaudi has the potential to be are the OEM partners that Intel announced. With Dell, HPE, Levono and Supermicro on board, they will at least have a similar seat at the table as AMD—the other challenger to Nvidia’s dominance in this space.

With Nvidia reportedly being sold out for the rest of the year given their capacity allocation at TSMC, both Intel and AMD have a window of opportunity to capitalize on pent-up demand in what can only be described as a feeding frenzy of accelerated AI data center buildouts. According to Intel, Gaudi 3’s air-cooled variant is sampling now with the liquid-cooled variant sampling this quarter and volume production of the former in the third quarter and the latter in the fourth quarter.

Assuming their OEM partners can also deliver, this could provide a 6- to 12-month window in which Intel-based servers could fill the shortfall. Maximizing on this window is critical for the challengers as once these servers are deployed, this will help give them a beachhead to protect and ultimately grow.

By: DocMemory
Copyright © 2023 CST, Inc. All Rights Reserved

CST Inc. Memory Tester DDR Tester
Copyright © 1994 - 2023 CST, Inc. All Rights Reserved