Wednesday, June 26, 2024
At Computex 2024, AMD CEO Lisa Su took to the spotlight to preview a new version of its flagship MI300X data center GPU, the MI325X. MI325X will have more and faster memory than the current-gen MI300X (HBM3E versus HBM3, and 1.5x the capacity at 288 GB), which will be critical for at-scale inference of large language models (LLMs). Su also announced that AMD’s Instinct data center accelerators will be moving to an annual cadence from here on out, in a move that echoes market leader Nvidia. But is AMD’s software stack keeping pace with its hardware roadmap?
AMD is working very hard on ROCm, Vamsi Boppana, senior VP of the AI group at AMD, told EE Times.
“Getting MI300X onto the market with ROCm supported [in December] was a big deal for us,” he said. “In addition to supporting the device itself, our big push last year was we said: any model under the sun has to run. That is table stakes…I’m very pleased with the progress we’ve made.”
AMD announced a partnership with online model library HuggingFace last year, whereby all HuggingFace models will run on AMD’s Instinct data center CPUs and GPUs (including MI300 and MI300X). Currently all of HuggingFace’s 600,000 models are guaranteed to run out-of-the-box on Instinct accelerators. (Backing up this guarantee is a suite of nightly tests—currently at 62,000 and growing—that the two companies have created based on the backbones of all HuggingFace models, rather than the entire models themselves.)
“Based on that and based on what we are seeing with customers, the customer experience is pretty good out of the box,” Boppana said.
Collective communication
AMD has also been working on its algorithms, libraries and optimization techniques for generative AI in ROCm version 6.1. This version supports the latest attention algorithms and has implemented improved libraries for FP16 and FP8, Boppana said. AMD has also been working with lead customers and the open-source community to improve its ROCm Communication Collectives Library (RCCL) for GPU-to-GPU communication, which is critical in unlocking generative AI inference performance.
Large generative AI models are both compute and memory intensive. For inference, their size means they have to be split across several of even the biggest accelerators.
“When you cut the problem and do these computations inside [different] GPUs, then you need to synchronize, saying: this sub-problem got done here, this one got done here, let’s exchange the data and make sure everything is synchronized before the next set of work gets dispatched,” Boppana explained. This layer is referred to as communication collectives, and it includes strategies to improve latency and link utilization at different message sizes—plus schemes for overlapping communication and computation.
Jax support
Ongoing work with the open-source community includes starting to extend ROCm’s existing support for the Jax framework.
“Pytorch is still dominant, the majority of our engagements are still Pytorch based, but we do see Jax,” Boppana said, noting that since Jax emerged from the Google ecosystem, customers with heritage at DeepMind or in the Google community are asking for support with Jax.
OpenAI’s Triton framework also now supports AMD Instinct accelerators. Triton makes possible code portability between GPU vendors, whether developers are writing code they want to ensure is portable, or starting with existing code they want to port to new hardware. Triton allows developers to program at the higher levels of abstraction, with optimization algorithms to do the lower-level work, including partitioning large workloads and optimizing data movement.
“The industry wants to program at a higher level of abstraction—it’s difficult to program at the lowest level,” Boppana said. “If you have mature algorithms implemented in the frameworks, that’s the easiest path. But the field is evolving so fast that everybody wants to get the next step of evolution and develop the next set of optimized libraries. We need that intermediate layer of abstraction at which you get the best in terms of hardware capabilities, but you also need programmability efficiencies. Triton is starting to emerge as one [solution].”
Developers comfortable at the lower levels can, of course, continue to use ROCm to write custom kernels.
“The velocity of new AI models is so fast, you may not have the time to develop all these optimization algorithms at the low level,” Boppana said. “That’s when the industry needs something like Triton.”
Software stack unification
Boppana previously told EE Times that while AMD intends to unify AI software stacks across its portfolio (including Instinct’s ROCm, Vitis AI for FPGAs and Ryzen 7040, and ZennDNN on its CPUs)—and that there is customer pull for this—it will not “disassemble the engine while the plane is flying.”
“We are committed to that vision and roadmap, but we’re not going to do unification for the sake of it. We’ll do it where there is value,” Boppana reiterated.
Use cases which will see value in software stack unification include platforms like PCs. PCs will increasingly have all three AI processor types—CPUs, GPUs and NPUs—and today, workloads are developed in silos for one of the three processor types. In the future, apps will spread their workloads across hardware types. For example, a game might use both a GPU for rendering and an NPU to run an LLM that powers a non-player character’s dialogue.
“Fundamentally, there’s no reason why these three engines have to have three different software stacks, three different experiences being put together by system integrators,” Boppana said. “Our vision is, if you have an AI model, we will provide a unified front end that it lands on and it gets partitioned automatically—this layer is best supported here, run it here—so there’s a clear value proposition and ease of use for our platforms that we can enable.”
The other thing customers sometimes run into are difficulties managing and maintaining a coherent stack internally, he said. A single, unified AMD stack would help the customer as AMD can then figure out the consistencies required in the software environment.
“The approach we will take will be a unified model ingest that will sit under an ONNX endpoint,” he said. “The front end we provide will decide through an intermediate level of abstraction which parts of the graph will be run where. Not all parts of the stacks need to be unified—lower levels that are device specific will always be separate—but the model ingest and its user experience will be consistent.”
The first parts of the stacks to be unified will be tools that are hardware-agnostic, like quantizers.
“The internal benefit we see right now is that as we leverage our investments across all our software teams, we don’t need to develop three different front ends for three different platforms,” Boppana said.
Consumer GPUs
With version 6.1 of ROCm, AMD introduced support for more of its Radeon consumer GPU products.
“The most important reason for us is we want more people with access to our platforms, we want more developers using ROCm,” Boppana said. “There’s obviously a lot of demand for using these products in different use cases, but the overarching reason is for us to enable the community to program our targets.”
This community includes consumers, but also smaller AI startups and research groups who cannot afford bigger hardware, he added.
Is AMD planning to extend support to all consumer GPUs?
“We would love to do more, it’s just a priority decision of how much resource we have and how much time we have,” Boppana said. AMD started with its most powerful GPUs as they are more relevant for AI, but will move down the list, he said.
Overall, is ROCm still AMD’s number one priority?
“AI is, for sure,” Boppana said. “Lisa [Su] has been extremely clear that we must have leadership software to be able to compete in AI. Many meetings I would be in front of her and she would say: great update, but are you going fast enough? Can you go faster? So she’s been very supportive.”
By: DocMemory Copyright © 2023 CST, Inc. All Rights Reserved
|