Home
News
Products
Corporate
Contact
 
Saturday, November 23, 2024

News
Industry News
Publications
CST News
Help/Support
Software
Tester FAQs
Industry News

Chinese startup GPU company aims at AI and Data Center applications


Tuesday, August 30, 2022

Chinese startup Biren has emerged from stealth, detailing a large, general-purpose GPU (GPGPU) chip intended for AI training and inference in the data center. The BR100 is composed of two identical compute chiplets, built on TSMC 7nm at 537mm2 each, plus four stacks of HBM2e in a CoWoS package.

“We were determined to build larger chips, so we had to be creative with packaging to make BR100’s design economically viable,” said Biren CEO Lingjie Xu. “BR100’s cost can be measured by better architectural efficiency in terms of performance per watt and performance per square millimeter.”

The BR100 can achieve 2 POPS of INT8 performance, 1 PFLOPS of BF16, or 256 TFLOPS of FP32. This is doubled to 512 TFLOPS of 32-bit performance when using Biren’s new TF32+ number format. The GPU also supports other 16- and 32-bit formats but not 64-bit (64-bit is not widely used for AI workloads outside of scientific computing).

Using chiplets for the design meant Biren could break the reticle limit but retain yield advantages that come with smaller die to reduce cost. Xu said that compared with a hypothetical reticle-sized design based on the same GPU architecture, the two-chiplet BR100 achieves 30% more performance (it is 25% larger in compute die area) and 20% better yield.

Another advantage of the chiplet design is that the same tapeout can be used to make multiple products. Biren also has the single-chiplet BR104 on its roadmap.

The BR100 will come in OCP accelerator module (OAM) format, while the BR104 will come on PCIe cards. Together, 8 × BR100 OAM modules will form “the most powerful GPGPU server in the world, purpose-built for AI,” said Xu. The company is also working with OEMs and ODMs.

PETAFLOPS-CAPABLE

High-speed serial links between the chiplets offer 896-GB/s bidirectional bandwidth, which allows the two compute tiles to operate like one SoC, said Biren CTO Mike Hong.

As well as its GPU architecture, Biren has also developed a dedicated 412-GB/s chip-to-chip (BR100 to BR100) interconnect called BLink, with eight BLink ports per chip. This is used to connect to other BR100s in a server node.

Each compute tile has 16 × streaming processor clusters (SPCs), connected by a 2D mesh-like network on chip (NOC). The NOC has multi-tasking capability for data-parallel or model-parallel operation.

Each SPC has 16 execution units (EUs), which can be split into compute units (CUs) of four, eight, or 16 EUs.

Each EU has 16 × streaming processing cores (V-cores) and one tensor core (T-core). The V-cores are general-purpose SIMT processors with a full-set ISA for general-purpose computing—they handle data preprocessing, handle operations like Batch Norm and ReLU, and manage the T-core. The T-core accelerates matrix multiplication and addition, plus convolution—these operations make up the bulk of a typical deep-learning workload.

Biren has also invented its own number format, E8M15, which it calls TF32+. This format is intended for AI training; it has the same-sized exponent (same dynamic range) as Nvidia’s TF32 format but with five extra bits of mantissa (in other words, it is five bits more precise). This means the BF16 multiplier can be reused for TF32+, simplifying the design of the T-core.

Xu said the company has already submitted results to the next round of MLPerf inference scores, which should be available in the next few weeks.

By: DocMemory
Copyright © 2023 CST, Inc. All Rights Reserved

CST Inc. Memory Tester DDR Tester
Copyright © 1994 - 2023 CST, Inc. All Rights Reserved