Home
News
Products
Corporate
Contact
 
Tuesday, December 24, 2024

News
Industry News
Publications
CST News
Help/Support
Software
Tester FAQs
Industry News

Enfabrica's first chip allows I/O to scale, so AI inference isn't memory bound


Wednesday, April 12, 2023

Startup Enfabrica has emerged from stealth with a new concept for data center networking—the accelerated compute fabric (ACF), Enfabrica CEO Rochan Sankar told EE Times. The ACF will, the company said, elastically bind CPUs and accelerators to memory and storage, eliminating I/O bottlenecks.

Enfabrica is working on its first chip, the ACF switch, which will replace multiple NICs, PCIe/CXL switches and top-of-rack (TOR) switch chips in each rack. The result is lower cost plus higher GPU utilization, with Enfabrica suggesting large language model (LLM) inference could use half the number of GPUs, for example.

As the need for AI accelerators in the data center increases, GPUs and other accelerators have become the biggest consumer and producer of data. While the trend for accelerated computing gains steam, more data needs to move between accelerators, memory and storage, but I/O technologies haven’t changed much in a couple of decades, Sankar said.

“The lag of I/O bandwidth on those devices is already becoming the bottleneck,” he said. “Compute-bound [workloads] have raced ahead, but I/O and memory bound [workloads] are going to be the critical limiting factors—they drive the cost of compute very high, because you need to buy more units of compute that package all of the I/O and memory to fit the bound.”

Enfabrica execs believe that I/O is actually the critical resource to scaling AI computing, and that both server-facing and network-facing I/O can be disaggregated.

“An evolved server and rack I/O architecture is needed to support this very pronounced transition in compute infrastructure, and that architecture needs to deliver a step function in bandwidth per dollar, performance per dollar and performance per Watt, to make AI and accelerated computing sustainable in nature,” Sankar said.

The startup is proposing a new class of product, which it calls the accelerated computing fabric switch (ACF-S)—a memory interface with Ethernet-based physical transfer up to 800 Gbps, without changes to the existing network stack. It can plug into existing IP ECMP networks in cloud infrastructure. It also features host bus interfaces over PCIe/CXL, which can connect to any kind of computing resource (CPU, GPU, FPGA or ASIC), or any disaggregated CXL memory or NVMe storage, at “many terabits per second.”

“We have refactored these three layers into a single device,” Sankar said. “We haven’t just sandwiched them all into one piece of silicon—that wouldn’t fit!—we’ve refactored functionality so that a data movement operation that emanates from an accelerator can jump straight to the network and understand its quality of service, understand the queue it’s in, understand the attributes of the tenants it’s associated with… so network bandwidth can be right-sized.”

The result is a hub-and-spoke model that can disaggregate and scale accelerators, hosts and memories, across capacity and bandwidth, while trading off latency and bandwidth at various points in the hierarchy. The ACF-S can enable 100 ns access times for tens of terabytes of dynamic random-access memory (DRAM) via CXL, whereas today’s accelerators may only have tens of gigabytes of HBM on-chip and beyond that would need to connect to off-chip memory via a CPU.

“Memory is the big wall in AI,” Sankar said. “This architecture collapses the functionality of 12 different chips [per system], eliminating thousands of serdes that are going up and down the rack, and enables resources connected to the fabric to be presented to the network, and not to be stranded. This dramatically increases the utilization of expensive accelerated computing resources [like GPUs].”

On the chip

The ACF-S is a data path-oriented chip, with an Ethernet switching plane, a large shared buffering surface and an array of high-performance copy engines that interface flexibly to any resource over PCIe or CXL. All the PCIe/CXL ports are identical and can be bifurcated. Enfabrica plans a family of devices that will address various power/performance and use case applications in accelerated computing networks.

“This is not a CXL fabric, this is not an Ethernet switch, this is not a DPU—it can do all those things,” Sankar said. “It is a different class of product to deal with a different class of problem.”

While CXL is a powerful standard, Sankar said, Enfabrica believes CXL will require a lot of additional effort to connect resources across racks while maintaining low latency and coherency.

“The devices as implemented today on CXL 2.0 do not solve the AI memory problem, because you can’t attach CXL memory to a GPU,” he said. “Leading GPUs today don’t support CXL interfaces. But even if that were the case, the standard doesn’t allow it, unless it’s attached through a CPU.”

This means there are still a finite number of memory channels serving many cores, which means there will still be contention—which will still result in tail latencies, Sankar argued.

Adding more GPU and CPUs connected at high bandwidth means memory scales horizontally, but through Enfabrica’s accelerated compute fabric, there are tiers of latency and data that can be dynamically dispatched with equal latency to the processor that needs it.

The effects are significant: Sankar asserted that LLM inference can achieve the same performance with half as many GPUs because Enfabrica’s pool of memory can dynamically dispatch user data to the GPU on demand, utilizing the GPU more effectively. For applications that are more memory-bound, such as recommendation (DLRM) inference, the number of GPUs can be reduced by 75%. In at-scale AI systems today, memory constraints mean systems use more GPUs than are necessary for compute, which are then not fully utilized. While CPUs can reach almost 100% utilization with virtualization, GPU utilization on average might be sub-50%, or even sub-30%, Sankar said. Compared with an Nvidia DGX-H100 system, and a Meta Grand Teton with eight Nvidia H100 GPUs, an Enfabrica-accelerated equivalent system could reduce I/O power up to 50% (saving 2kW per rack), according to the company. (In this setup, two Enfabrica ACF-S chips, plus one post-processor per rack, would replace eight NICs and four PCIe switches per DGX system, or a total of up to 24 NICs and 16 PCIe switches per rack).

AI infrastructure

While AI inference at-scale is the most pressing problem today, Enfabrica’s concept also applies to AI training, and non-AI use cases like databases and grid computing.

“We’re the alternative to create low-cost AI training—when it becomes important not just to train at breakneck speed, but to do it in an envelope that’s affordable,” Sankar said.

Enfabrica plans to sell chips to system builders, not build systems itself. Sankar said Enfabrica is engaged with the Nvidia ecosystem, but that the company intends to support “everyone who builds their own accelerated computing and wants to leverage accelerated compute resources.”

“We have to recognize we’re at the very infancy of defining AI infrastructure and how it will scale,” he said. “My belief is that you go after the inalienable and persistent need, and what is clear here is that I/O is a major pivot point for the infrastructure, and there needs to be choice there.”

By: DocMemory
Copyright © 2023 CST, Inc. All Rights Reserved

CST Inc. Memory Tester DDR Tester
Copyright © 1994 - 2023 CST, Inc. All Rights Reserved