Home
News
Products
Corporate
Contact
 
Wednesday, January 22, 2025

News
Industry News
Publications
CST News
Help/Support
Software
Tester FAQs
Industry News

Conquering Generative AI Bottlenecks


Monday, January 8, 2024

Makers of generative AI hardware are focusing on bringing down the cost of using large language models (LLMs) along with optimizing efficiency and flexibility.

One bottleneck stems from the increasing parameter counts of LLMs, which can be measured in multiples of billions or trillions, Marshall Choy, senior VP of products at SambaNova Systems, a full-stack developer of software and hardware, told EE Times during a recent EE Times panel.

“So, what we’ve done is just put a whole lot of memory with a three-tier architecture for addressing things like latency, bandwidth and capacity all in one, to then bring down the size and the economics of what’s required to run these types of models,” he said. “And so, while the compute piece of this is somewhat commoditized when it comes to chips, it really becomes a memory issue for us.”

The panel discussion, “How Can We Keep Up With Generative AI?” was part of EE Times’ AI Everywhere 2023 virtual event in November, which is available for streaming here.

The ballooning size of LLMs is creating yet another bottleneck, that of accessibility, Choy said. The growth in size of LLMs leads to diminishing returns once the model reaches a trillion parameters: At that point, the hardware and cost to operate the LLM is out of reach for all but the Fortune 10 or 20, he said. To democratize the use of large models, SambaNova tweaked the classic “mixture-of-experts” approach and revised the name.

“How do we make this more accessible to the ‘fortune everyone’ is what we’ve done with what we call ‘composition-of-experts’,” Choy said.

Rather than using the mixture-of-experts approach where a complex predictive modeling problem is divided into subtasks to solve it, SambaNova trains domain expert models for greatest accuracy and task relevance. It then assembles a trillion-parameter composition-of-experts model, which can be trained on new data without sacrificing previous learning, while saving on compute latency and the costs of training, fine tuning and inferencing.

Breaking the training model chain

Matt Mattina, VP of AI hardware and models at Tenstorrent, an AI computer builder, sees efficiency gained by breaking the “inherent feedback loop where model architecture winds up being shaped by the hardware on which it’s trained,” he told EE Times.

By using model techniques like network architecture search with hardware in the loop, a model trainer can, during the training process, specify the hardware he’s going to run inference on, what it looks like, and its characteristics. The search paradigm will find a model that’s not necessarily suited for the machine the model’s being trained on, but suited for the machine that it’s going to ultimately inference on, Mattina said.

“There’s definitely a connection between the training platform and the models we see today,” he said. “But I think there’s a lot of interest and real engineering technology to break that connection so that we can find even more efficient models for inferencing.”

Specialize at system level

AI evolves so quickly, it’s hard to know how to balance using dedicated chips and custom silicon versus establishing flexibility in a system, but Jeff Wittich, chief product officer at Ampere Computing, a cloud native processor supplier, has some advice.

“I think today a lot of the specialization is best done at the system level, because that gives you the flexibility to mix and match, and create a solution that’s flexible, regardless of what happens over the next year or two when you’re kind of inside the window in which hardware can’t rapidly change,” he said. Traditionally, it’s taken five years to create and commercialize new hardware, he added.

To promote the flexibility Wittich is talking about, Ampere has partnered with several different companies that are building different training and inference accelerators. Coupling a general-purpose CPU with an inference or training accelerator that does a particular task really, really well is a great approach, in his view. Over time, the ability to couple those accelerators more tightly with the CPU itself will likely happen, he said.

“We also always need to just be cognizant of where integration benefits you and where it doesn’t,” Wittich said. “If you can improve performance and efficiency by integrating, that’s a great idea. If you just reduce flexibility, that’s likely not a great idea. So, I think we’ll see a range of options here, and a lot of it …in the next year or so, is probably at the system level.”

By: DocMemory
Copyright © 2023 CST, Inc. All Rights Reserved

CST Inc. Memory Tester DDR Tester
Copyright © 1994 - 2023 CST, Inc. All Rights Reserved