Cerebras IPO: Launching the next phase of the AI bubble.
The first true AI pure play is going public. How much is substance and how much is hype?
Disclaimer: The information contained in this article is not and should not be construed as investment advice. This is my investing journey and I simply share what I do and why I do that for educational and entertainment purposes.
This article is entirely free to read.
Today, Cerebras Systems Inc. $CBRS filed an S-1 for an imminent IPO. I touched on this company last year when I wrote my deep dive on the commercial rise of supercomputing. Back then I joked that this IPO would launch the final stage of the AI bubble. Naturally, I am interested in what this company has to say about their business when they finally go public. So, here is my first glance at their prospectus.
TLDR Summary
The company plans to raise $750m to $1bn, which would value them at $7-8bn. With 1H24 revenues of $136m, this would value them at about 30x annualized 1H24 revenues.
The stock comes with a story that can very well spark investor imagination. It was founded by a bunch of tech executives that have been working on energy efficient servers for almost two decades. They were bought by AMD and then shut down. This likely triggered the idea to try it again with a new venture, this time with potent financial backing from Sequoia.
They claim to have developed a secret sauce for AI computing. Traditional CPU-based computing systems are not ideal for the task. Chips are too small with insufficient computing capacity and stacking them together is inefficient.
Cerebras’ solution is to go really big with their chips. Each chip is made from an entire wafer. They coined this technology Wafer-Scale Engine (WSE). Their latest generation is 57x larger than the leading commercially available GPU and has 52x more compute cores. It comes with 880x more on-chip memory and 7,000x more memory bandwidth. Cerebras claims that this allows them to run even the largest, multi-trillion parameter GenAI models on a single chip which speeds up training. And it keeps all critical data on-chip which improves bandwidth and latency for inference.
Cerebras’ financials point to explosive growth. But almost 90% of their revenues are with a single customer, called G42, a technology holding owned by the UAEs’ sovereign wealth fund. G42 appears legit, but such concentration does raise some question marks on Cerebras’ credibility as an NVIDIA challenger.
Company Background
Cerebras was founded in 2016 by a bunch of tech executives that had founded a server technology company, called SeaMicro. SeaMicro was founded in 2007 and was later acquired by by AMD. SeaMicro’s goal was to build energy-efficient servers for data centers. Energy efficiency is a crucial scaling challenge today. Considering that the founders and leaders of this company anticipated this 17 years ago provides their efforts with credibility in my opinion.
AMD shut down SeaMicro in 2015 which may have been the trigger for them to found Cerebras. They had prominent venture capital support from the beginning, most notably by Sequoia which led their Series A funding round.
Cerebras designs processors for AI training and inference and they develop supercomputers for AI applications. Their latest Condor Galaxy 3 supercomputer in Dallas comes with 8 ExaFLOPs compute. For reference, META 0.00%↑ ’s Research SuperCluster is quoted at 5 ExaFlops.
Why are Supercomputers so important?
Last year, I wrote an article of the rise of commercial scale supercomputers.
Before the AI breakthrough epitomized by ChatGPT, supercomputers were mostly an academic phenomenon; large clunky systems for niche applications. Interest into them surged after the launch of ChatGPT. Its profundity was that it revealed the power of deep learning to the general public for the first time.
It’s amazing how it can imitate (or should I say replicate) human communication and comprehension with very simple statistical methods. And it’s even more amazing how close deep learning is to human learning. An LLM learns and communicates just like you and me. When we talk to each other, we don’t know the entire sentence or monologue we’re about to utter in advance. We speak one word at a time. And at the end of it, a profound thought or idea might have been born. ChatGPT sparks our imagination of what AI might be capable of soon with a human style of knowledge acquisition and distribution. This fascination started a gold rush in the IT industry.
One of the defining features of deep learning is the enormous size of the underlying models and the computing power and capacity needed to train them. General purpose data centers are not ideal for this task. Linear scaling is the holy grail in deep learning. You want two GPUs to perform twice as good as one. This is very difficult in traditional cloud set-ups. They are not designed to optimally triangulate compute, bandwidth and latency.
What are compute, bandwidth and latency?
The ideal computer system has to be optimized for its intended task along all relevant dimensions. It’s not just about how many calculations a system can hypothetically perform per time interval. It’s also about how much and how fast it can be provided with data. Achieving maximum performance is a lot about digital logistics, i.e. shifting bits and bytes around intelligently.
In my article last year, I had used the restaurant analogy below: If there are only burgers on the menu and there are 10 chefs in the kitchen, all of whom are capable of making burgers, then the restaurant will work very efficiently and productively. But what if the restaurant also wants to be able to serve sushi? Then at least one of the 10 chefs needs to be an Itamae. And if there are no sushi orders at a given point in time, the Itamae will be idle and the restaurant productivity will be down.
An ideal restaurant will optimize the number and qualifications of their chefs, the number of their tables and their serving time to maximize revenues. Likewise, a computing system should strive to optimize compute, bandwidth and latency.
In their S-1 filing, Cerebras touches on this challenge when they discuss the challenges in training and inference.
Difference between Training and Inference
The two fundamental processes in a supercomputer are training and inference. During training, the system processes a large amount of data and adjusts its internal parameters (weights and biases) to minimize errors and improve its ability to recognize patterns or make predictions.
Inference is the process where a trained model makes predictions or decisions based on new, unseen data. The model applies what it has learned during training to generate outputs (e.g., recognizing images, predicting outcomes).
Cerebras argues that traditional CPU-based computing systems are not optimized for these two tasks. For training, the GPUs don’t have enough computational resources and combining them is too inefficient. And for inference, they don’t have enough bandwidth for the frequent data movement to and from off-chip memory.
Here’s how they say that in their own words:
For Training – Individual GPUs Are Too Small, and Scaling to Many GPUs is Highly Inefficient
Large GenAI models far exceed the memory and processing limits of a single GPU. For example, to train GPT-3, it would take a single NVIDIA H100 more than eight years of running at peak theoretical performance to train the model. Recent models like GPT-4 and Gemini are over 10 times larger in parameter size than GPT-3. Consequently, training a large GenAI model on GPUs in a tractable amount of time requires breaking up the model and calculations, and distributing the pieces across hundreds or thousands of GPUs, creating extreme communication bottlenecks and power inefficiencies.
This distributed compute problem also creates a high level of complexity for developers, who are responsible for partitioning and coordinating the compute, memory, and communication across GPUs, so that they can work together in a complex choreography. This is an ongoing cost and slows down time-to-solution, as the delicate balance of bottlenecks needs to be reconfigured every time the ML developer wants to change the model architecture, model size, or run on a different number of GPUs.
For Inference – GPU Efficiency is Low and Limited by Memory Bandwidth
During generative inference, the full model must be run for each word that is generated. Since large models exceed on-chip GPU memory, this requires frequent data movement to and from off-chip memory. GPUs have relatively low memory bandwidth, meaning that the rate at which they can move information from off-chip HBM to on-chip SRAM, where the computation is done, is severely limited. This leads to low performance as GPUs cores are idle while waiting for data – they can run at less than 5% utilization on interactive generative inference tasks. Low utilization and limited memory bandwidth impact the responsiveness and throughput of GPU-based systems and hinders real-time applications for larger models. This inefficiency also necessitates larger GPU deployments and dramatically drives up the cost of inference.
Cerebras S-1 filing
Cerebras answer to this challenge: Make chips bigger!
In fact as big as an entire wafer:
They coined this technology Wafer-Scale Engine (WSE). Their latest generation is 57x larger than the leading commercially available GPU and has 52x more compute cores. It comes with 880x more on-chip memory and 7,000x more memory bandwidth. Cerebras claims that this allows them to run even the largest, multi-trillion parameter GenAI models on a single chip which speeds up training. And it keeps all critical data on-chip which improves bandwidth and latency for inference.
Financials
In 1H24, Cerebras generated $136m in revenues, 75% of which was hardware sales vs. 25% services. Their operating loss was $42m, implying a negative operating margin of 31%.
These big system sales are naturally lumpy, but their 1H24 performance does suggest explosive growth vs. the $79m in revenues for full year 2023 and $25m in 2022.
Furthermore, the company has bold goals for the future. They expect their TAM to grow by 51% annually to $453bn by 2027.
We believe that further adoption of AI, accelerated by the advent of GenAI, and the widespread integration of AI into business processes, will rapidly expand our total addressable market (“TAM”) from an estimated $131 billion in 2024 to $453 billion by 2027, a compounded annual growth rate (“CAGR”) of 51%.
Cerebras S-1 filing
Customer concentration
In 2023 and 2024 respectively, a single customer accounted for 83% and 87% of Cerebras’ revenues. This customer is called G42, a UAE-based technology holding owned by the sovereign wealth fund of the UAE. G42 is working closely together with Cerebras. For example, they are partnering to develop the Condor Galaxy supercomputer. Per the S-1 filing, G42 is committed to billions of dollars of purchases from Cerebras in the coming years.
My perception is that G42 seems to have assumed the role of being Cerebras’ primary distributor. It’s probably helpful to have that, but it also puts them into a dependent position, possibly with little bargaining power.
Possible insider sales
This is a much anticipated IPO that will likely come with substantial volatility. This volatility may be amplified by a curious lock-up structure. Apparently, insiders can sell shares as soon the stock has risen 25% from the IPO price and stays there for at least two days shortly after their 3Q24 earnings release. I am not sure what to make of that, but I certainly have not seen such a clause before.
Sincerely,
Your Fallacy Alarm
The music industry has 3 sectors: recording studios (creation), radio stations and streaming (distribution) and home receivers (playback). The AI industry has training (creation) and inference (playback), but no real distribution sector.
In 1948, Western Electric dominated all three music sectors: their gear was all over. Even today every radio broadcast tower has a giant (3-foot? 4-foot) vacuum tube manufactured in one of a few remaining specialized factories.
NVidia dominates both sectors. They have a huge head start in hardware&software synergies for training that I don't really see Cerebras taking them on in training.
Training a giant model concentrates on one big technical artifact that is not necessarily useful by itself- you have to surround it with apps and glue code to make it useful. I can see how Cerebras allows an all-new take on this problem. It's possible that Cerebras could dominate cloud-based inference, especially if it allows you to weld together a bunch of models into one giant image running on the wafer- that's a really intriguing option that nobody has now.
I haven't looked at the Cerebras architecture, and it's not clear that it is a coherent design that is tuned for AI's needs. NVidia happened to get this completely right by accident, when it concentrated on graphics cards 25 years ago and the deep learning guys found it 15 years ago.
Cheers!