Home
News
Products
Corporate
Contact
 
Saturday, March 8, 2025

News
Industry News
Publications
CST News
Help/Support
Software
Tester FAQs
Industry News

Startup Takes on Nvidia with FPGA-based AI


Wednesday, February 26, 2025

Data center AI systems startup Positron, just 18 months old, has been shipping its FPGA-based LLM inference systems to customers since last summer, and recently delivered the first systems in a multi-million-dollar order to its Tier 2 CSP customer, Positron CTO and co-founder Thomas Sohmers told EE Times.

“It’s been a really great start to the year and we expect that customer and a number of others to be scaling significantly in the first half of this year,” Sohmers said.

A further 20 potential customers are currently evaluating Positron’s FPGA-based AI appliance, Atlas, either directly or remotely, Sohmers said. Positron customers include enterprises running on-prem or co-located infrastructure, and Tier 2 CSPs.

“Most of the conversations we’ve been having, especially about larger scale [deployments], have been with companies that either are themselves CSPs or companies providing scaled web services,” Sohmers added.

Seed funding

Positron was founded in April 2023 by Sohmers and chief scientist Edward Kmett. Both had previously been working at AI inference startup Groq. The company appointed a new CEO, Mitesh Agrawal, who joined from AI CSP Lambda earlier this month, and raised $23.5 million in funding.

“When we founded Positron, it was focused on the fact that only two things matter—having a completely seamless experience going from Nvidia-based systems…and the failure point we saw for so many AI chip startups is they just took way too long and way too much to get to market,” Sohmers said, noting that while the company is working on its own AI inference accelerator ASIC, its first- and second-generation Atlas systems are FPGA-based.

FPGAs cannot offer the FLOPS of GPUs or ASIC solutions, but they have other benefits, he said.

“We don’t want to spend the massive amount of time and money on building an ASIC until we are absolutely sure we have product-market fit,” Sohmers said. “While other AI chip companies have their own unique problems, they all have problems with product-market fit, especially with first-generation devices. Going with FPGAs enabled us to do very rapid iteration, and start that iteration with customers [on board].”

Customer dollars coming in are the best indicator of product-market fit, he added.

Sohmers’ pre-Groq experience at AI CSP Lambda highlighted the need for rapid iteration cycles and for constant dialog with customers. Not all AI chip companies appreciate this, he said.

“The most important thing is making sure you’re building the right thing, and constantly updating your assumptions about that,” Sohmers said.

While Positron does have a customer purchasing its PCIe cards by the thousands, Atlas’ appliance form factor better suits most cloud companies who are used to buying Nvidia boxes, plugging them in and renting them out.

“We believe this appliance model—just tokens in, tokens out, a black box—is the easiest way for customers to purchase our hardware and get all the benefits of owning it from a capex and opex perspective,” Sohmers said. “But it also makes it easy for those customers to replace or augment their existing Nvidia-based systems.”

Memory bandwidth utilization

Positron’s Atlas LLM inference appliance offers 70% faster performance (tokens per second) versus the same inference workload on Nvidia Hopper-based systems, at 3.5× the performance per Watt and 3.5× the performance per dollar, based on Positron’s current software release. It is based on Altera’s Agilex-7M FPGAs, which currently only Positron is authorized to ship, given that the parts are not yet at general availability, Sohmers said. These FPGAs come with 32GB HBM.

The current gen of Atlas is a 4U system using four FPGAs on PCIe cards. It is designed to be a turnkey appliance, ingesting binaries from HuggingFace or customers’ proprietary models, in a zero-step process (no need to re-compile).

“A one-step process was a step too many,” Sohmers said. “There is nothing new or unique that needs to be done to run models on Positron.”

A next-generation platform will use Positron’s custom module form factor (analogous to Nvidia SXM) to shrink a four-FPGA system into 2U, with significantly expanded DDR memory. That system is due later this year and Sohmers said it is projected to offer 5× the performance of Nvidia Blackwell. The big jump in performance from Positron’s first gen is down to forthcoming software optimizations and FPGA optimizations, which can unlock more performance (including shifting more operations from host CPU onto the FPGA), though the first-gen version of Atlas will also get these updates as they become available.

So, how does Positron get better performance from hardware with fewer FLOPs and minimal memory? Sohmers explained that while CNNs are compute-bound, transformers are memory-bound, both in terms of memory bandwidth and memory capacity. GPU-based inference solutions have been shown to use less than 30% of their theoretical peak memory bandwidth for transformer inference. The Altera Agilex-7M is the only FPGA with both HBM and DDR5 memory, and while its compute FLOPS may be limited, memory bandwidth is what matters, Sohmers said.

“You may be paying for a very expensive memory and very high theoretical memory bandwidth [with GPUs], but fundamentally due to GPU architectures, you’re not able to achieve anywhere close to that memory bandwidth,” he said. “Our design implemented on the FPGA is actually achieving and sustaining 93% of its theoretical memory bandwidth, across all use cases.”

The remaining 7% of performance cannot be obtained due to the inability to control HBM’s refresh cycles, he added.

How the company gets this memory bandwidth utilization is Positron’s key IP; Sohmers said the company works at lower levels than Altera’s Quartus tools allow to maximize the density of its matmul array and the memory interconnect that feeds it. Positron was achieving 65-70% of theoretical peak memory bandwidth with its initial prototypes based on previous-generation HBM-equipped Stratix devices. But upgrading to Agilex meant the team could take advantage of Altera’s new hardened Fabric NoC, which is designed to support fast transfer between the FPGA’s memories, rather than relying on channels that are also used for the rest of the chip’s programmable logic resources. The new NoC has dedicated pathways from the HBM to SRAM blocks anywhere in the programmable logic array.

“Since this is a new feature, we worked very closely with the Altera team to make sure we could actually utilize it to its maximum potential,” Sohmers said. “There was a lot of new thinking required within our linear algebra systolic array design to make sure we could keep up at the reprogrammable clock rate, to make sure there was that one-to-one balance between FLOPS and memory.”

The Agilex-7M has four channels of DDR5, as well as 32GB HBM 2e that Positron uses as separate memories (not a tiered cache system) combined with some “fancy tricks” in SRAM. HBM is used where high performance is required—in this case, storing model weights. DDR is used to store user context, KV cache and different models to be swapped in (like different LoRA fine tunings, which can be applied to different users within a batch, Sohmers said). Up to 512GB of additional DDR5 can be attached.

Hardware roadmap

Given the FPGA’s reprogrammability, could Positron specialize its appliances further for certain applications within inference to gain extra performance?

Sohmers said the company did not want to risk ending up in a niche (beyond LLM inference), but that as the company expands its team beyond its 15 current employees following its new funding, “there are definitely areas we could optimize and potentially have that available to customers in the same physical appliance.”

Positron is also working on an ASIC version of its FPGA-based design. This ASIC will use LPDDR only (no HBM).

“With LPDDR 5X and 6, we’re able to get massively higher capacity than HBM at a quarter of the cost per Gigabyte,” Sohmers said. “The packaging will be a regular organic substrate, and that drastically reduces the cost of the product.”

While LPDDR is not as fast as HBM, using Positron’s IP to get close to the theoretical peak memory bandwidth more than makes up for it, he said. Positron can also directly control the memory refreshes on DDR, which enables the company to get even closer to theoretical peak performance than it could with HBM, without the power or cost overheads HBM has.

Sohmers also hinted that the new ASIC will be set up for networking; Agilex FPGAs each have three 400 Gbps networking transceivers, unlike GPUs which require additional NICs. Positron can connect 256 FPGAs point-to-point without additional switches, he said.

U.S. fabrication of chips and systems for AI infrastructure may hold some advantages (given U.S. vice president JD Vance’s comments at the AI Summit in Paris last week that “the Trump administration will ensure that the most powerful AI systems are built in the U.S. with American designed and manufactured chips”).

Agilex-7M FPGAs are fabbed at IFS in Chandler, Arizona. Positron is aiming for U.S. fabrication of its ASIC on TSMC N5 in Tuscon, Arizona, while LPDDR vendors are expanding U.S. fabrication, Sohmers said. PCIe cards will be assembled in Camarillo, Calif., with final assembly and test currently in Spokane, Washington, but planned to move to Reno, Nevada, in the near future.

Positron is aiming to sample its ASIC in Q1 2026.

By: DocMemory
Copyright © 2023 CST, Inc. All Rights Reserved

CST Inc. Memory Tester DDR Tester
Copyright © 1994 - 2023 CST, Inc. All Rights Reserved