Monday, February 10, 2025
Former Intel CEO Pat Gelsinger revealed he has invested in British AI chip startup Fractile.
“With the advent of reasoning models, which require memory-bound generation of thousands of output tokens, the limitations of existing hardware roadmaps have compounded. To achieve our aspirations for AI, we will need radically faster, cheaper and much lower power inference. I’m pleased to share that I’ve recently invested in Fractile, a U.K.-founded AI hardware company who are pursuing a path that’s radical enough to offer such a leap,” Gelsinger said in a LinkedIn post.
Gelsinger also noted that the role of inference performance is still under appreciated in the development of frontier AI models; better inference performance is equivalent to years of lead on model development, he argued.
So, who is Fractile and what is the company working on?
Fractile was founded in 2022 by CEO Walter Goodwin. The company raised a $15 million seed round in summer 2024 and received a grant from the U.K. government’s ARIA program of $6.52 million in October. It will use the funding to develop its data center LLM inference accelerator for models with billions or hundreds of billions of parameters, using in-memory compute. Fractile’s projections have its accelerator running Llama2-70B 100× faster than Nvidia H100s (decode tokens/second) at one-tenth of the system cost.
Before starting Fractile, Goodwin had completed a Ph.D. in AI and robotics at the University of Oxford, working on using large multi-modal foundation models to make robots that are better able to generalize.
“I was excited by the scaling hypothesis and the idea that deep learning would [change] from thousands of companies training thousands of different models to tackle particular problems, we’d all just say thank you to whoever spent the most money to train something on a web-scale dataset and then find ways to leverage that same backbone architecture to power different things,” Goodwin told EE Times in an earlier interview. “That’s what led to me starting Fractile—I started to see that for inference, because of the memory bottleneck, you spend a lot of time not at the top of your roofline.”
Fractile’s AI accelerator concept uses in-memory compute. While Goodwin declined to say whether there are any analog compute elements to the design, he said it will use Fractile’s own CMOS SRAM cell design. Fractile’s Technical Design Authority, Tony Stansfield, is a 10-year veteran of SureCore, a British SRAM designer.
“It’s transistor-level design, for sure,” Goodwin said. “This is not a fundamentally novel approach to memory, we’re using relatively standard, albeit somewhat modified memory cells, and doing a custom circuit layout and design for that. That’s part of how we’re driving up density and TOPS/W.”
In-memory compute as a technique to provide fast, high throughput matrix-vector multiplication is well known, and this increases in importance as we move to foundation models, Goodwin said.
While in-memory compute offers modest advantages for CNN inference, Goodwin argues, CNN workloads often require a mixture of matmul and other operations, and they have smaller matrices and kernels. In-memory accelerators keep weights stationary in memory so they do not need to keep transferring weights between processor and memory, but moving activations around the chip is still a relatively large part of the workload, so the performance advantages in-memory compute brings for CNNs are relatively modest.
For LLMs, there are many, many more weights than activations, and activations are smaller—the nature of the workload amplifies the advantages in-memory compute offers.
“One of the hallmarks of multi-billion-parameter models is the matrix multiply, in particular the very wide matrix,” Goodwin said. “Because activations come out of the sides of those matrices, they’re dramatically smaller, one ten-thousandth as many activations as weights for inference. That’s a change in design point in terms of how far you can push the advantage from keeping matrices or weights stationary in memory.”
While in-memory compute is well suited to LLMs, many existing in-memory compute architectures that were built for the CNN era are also at a disadvantage because unlike CNNs, LLMs are characterized by variable length inputs and outputs, Goodwin added.
“[The existing concept] is strained by LLMs even for a single user, where there are two distinct stages and those stages are of uncertain duration,” he said. “If your compiler paradigm expects a static list of what needs to be done in what order, and has compiled it to flow through the chip in a certain way, [and calculated when] things are going to trigger, you’re inherently going to be leaving some performance on the table because you’ll have to be padding things to fit that sequence length, and so on.”
Existing architectures are built around matrix-matrix multiplication for better data reuse or because there is a systolic array of a particular size. For workloads that switch between prompt processing (long sequences of data) and the decode stage (one word at a time) matrix-vector multiplication is more flexible and, therefore, a better fit, Goodwin added, noting that flexibility is a key part of Fractile’s architecture.
“This sounds like an LLM-specific concept, but one of the things that is really sticky about AI is, even five years into the future, we’re still going to be operating on data as sequences, we’ll still be tokenizing everything,” he said. “In-memory compute, by not having the memory access bottleneck, allows efficient matrix-vector multiply, which allows you to create systems that are much more effective at sequence processing.”
With Goodwin’s background in robotics, can he see a need for efficient LLM inference at the edge, in robots in particular? While the company is open to talking with teams who are defining frontier models in every sector, Fractile’s solution may not be suitable for current edge applications, he said.
“When you can serve huge throughput, it makes sense to look for those places where there is throughput demand, and amortize your hardware and get those cost savings,” he said. “Right now, that’s quite clearly a data center grade solution. But the edge actually does have that all-you-can-eat appetite for token processing as well, it’s just that we’re only just starting to build those systems.”
The dawn of agentic AI in robotics could be the turning point, since it would require multiple times more throughput with fast latency.
“For future robot platforms, I think we should be rethinking what it means to have inference running there,” he said. “It’s not going to be something processing one image every 30 milliseconds. It might very well look like a multi-user data center grade inference server, with perhaps 32 separate threads running at thousands of tokens per second to come up with the next action that you take 20 milliseconds later.”
By: DocMemory Copyright © 2023 CST, Inc. All Rights Reserved
|