Thursday, July 11, 2019
Xilinx has shipped the first Versal devices to select customers as part of its early access program, a milestone for the company’s heterogeneous compute architecture. Versal devices use Xilinx’s adaptive compute acceleration platform (ACAP), part of the company’s strategy for modern workloads including high speed networking, 5G, and artificial intelligence (AI).
“Having our first Versal ACAP silicon back from TSMC ahead of schedule and shipping to early access customers is a historic milestone and engineering accomplishment,” said Victor Peng, president and CEO of Xilinx, in a statement. “It is the culmination of many years of software and hardware investments and everything we’ve learned about architectures over the past 35 years.”
Built on TSMC’s 7-nm FinFET process technology, the first devices to ship are from the Versal Prime series (for a variety of applications) and the Versal AI Core series (for acceleration of AI inference workloads). According to Xilinx, the AI Core series can outperform GPUs by 8X on AI inference (based on sub-2ms latency convolutional neural network performance versus Nvidia V100).
In an interview with EETimes Europe, Xilinx’s Nick Ni, director of product marketing for AI, software and ecosystem, said that in the AI accelerator market in particular, there is a lot at stake.
“Everybody is betting on something, including [established players] and ASIC startups … we think this is an intelligent bet,” said Ni. “The truth is, nobody has captured the market in AI … it’s just really getting started. Everyone agrees the hardware will be the bottleneck of mass deployment — whoever gets the hardware right for this moving target workload, which is very hard to design for, [will win the bet].”
Ni said that the rapid pace of innovation in fields such as neural networks and artificial intelligence have so far left hardware running to catch up.
“Developing an ASIC for this market might take a couple of years to develop, verify, and go to market — by then the workload has completely changed,” he said. “When workloads continue to change, efficiency goes down, or you can’t [continue to] support the feature sets needed. … With AI workloads, which require low power and high efficiency, you need domain specific architecture, that is, you need a specific accelerator for each problem that you have.”
AI workloads are notoriously diverse and fast-moving in terms of structure. While all neural networks require huge amounts of compute and a lot of data transfer between different multiply-accumulate units (MACs), even basic image recognition workloads differ vastly depending on which neural network is used.
“FPGAs have always been in the sweet spot of being able to work on all sorts of workloads efficiently,” he said. “AI has pushed us to the point where that is nice, but it’s not good enough: you really need ASIC-like performance in some portions of your workload, and you also need some flexibility to adapt to the latest innovations.”
To accelerate neural networks efficiently in hardware, three things have to be customized for every AI network, Ni explained.
First, the data path has to be customized. Data paths vary from the simplest feed-forward networks (e.g., AlexNet) to more complex paths with branches (e.g., GoogleNet), to the latest networks with skip connections and merging paths (e.g., DenseNet).
Second, is precision. The less precision you use, the more power you save, which is important for inference on edge devices; AI workloads have different “sweet spots” in terms of the number of bits required.
Third, how do you move data around the chip?
“Many chip [makers] say they have huge peak performance, but if you can’t pump the data into the engine fast enough, most of the time it’s sitting idle,” said Ni. “The [pace of] innovation of AI is making this part difficult because the more complex the neural networks become, the more complex the data flow becomes, and how you move the data from memory to the engine and back efficiently becomes the bottleneck. This has to be customized for every network.”
Domain Specific Architecture
Xilinx’s vision for neural network acceleration is a domain specific architecture in which the data path, precision, and memory hierarchy can all be customized.
“We are pushing from general purpose architecture, from CPUs and GPUs, into special hardware architectures for each problem … creating a custom system for each problem,” he said. “That’s extremely difficult for GPUs and ASICs, but it’s easier for FPGAs as we can reprogram the different hardware architecture for each [neural network].”
Versal is the first device built on Xilinx’s ACAP. ACAP combines scalar processing blocks (a dual core ARM Cortex-A72 application processor and an ARM Cortex-R5F real-time processor) with adaptable hardware engines (its new name for programmable logic) and intelligent engines (specialized DSP and AI engines), plus memory and interfaces.
The AI engine is “basically AI ASIC hardware — it has more flexibility to it [than an ASIC] but it achieves hundreds of TOPS for AI workloads,” Ni said. Up to 400 of these inference engines are included on the Versal AI Core devices.
In between the blocks is a software programmable multi-terabit network-on-chip which enables data transfer between the engines, the memory, and the interfaces.
“This has been one of the choking points for FPGAs in the past, routing between different logics could degrade the performance, and having an actual ASIC — ahardened network-on-chip topology — we can move the data at a very fast speed, like gigahertz kinds of speed,” said Ni.
While FPGAs traditionally had their own specialist programming languages, today’s AI accelerators will need to be accessible to both software and hardware developers and data scientists.
“As a company, we have invested significantly in bringing up the ease of use of software tools and frameworks for AI and software,” said Ni, referring to Xilinx’s purchase of Chinese AI startup DeePhi last year.
“If we want to go after AI and software developers, we have to support C, C++, OpenCV, OpenCL — those kinds of languages. Then for AI developers, new languages like Python, Caffe, TensorFlow. … We are transforming the way we are hiring more software [people] and investing into tools and frameworks.”
Available solutions include Xilinx’s DNNDK and ML Suite platforms, which can compile deep learning frameworks for FPGA boards without writing any C code or RTL code, Ni said.
“We will continue to enhance this in the future and import it to Versal,” he said.
Versal currently comes with its own development environment with a software stack that includes drivers, middleware, libraries, and software framework support. The Versal AI Core and Versal Prime series will be available in the second half of 2019.
Copyright © 2019 CST, Inc. All Rights Reserved