Elon Musk’s AI company, xAI, recently unveiled a new supercomputer dubbed Colossus. And like its name implies, it’s big.
The computer is an artificial intelligence training system that Musk says runs on a whopping 100,000 Nvidia H100 chips, a powerful graphics processing unit that’s become critical to the AI race.
To put that into perspective, Meta’s Llama 3 large language model was trained on 16,000 H100 chips. Meta said in March it would continue to invest in its AI infrastructure by adding two new 24,000 chip clusters.
In other words, Musk’s Colossus is powerful. And it could help him catch up to the AI industry’s frontrunners.
But some prominent tech leaders are not so sure.
LinkedIn cofounder Reid Hoffman told The Information, a tech publication, that the xAI supercomputer was mere “table stakes” in the competitive field of generative AI.
According to The Information, Hoffman meant that Colossus only allows xAI to catch up to other, more advanced AI companies, like OpenAI and Anthropic.
Chris Lattner, the CEO of Modular AI, said during a panel discussion at The Information’s AI Summit last week that Musk’s heavy reliance on Nvidia’s expensive and finite chips is also inconsistent with the billionaire’s effort to build his own GPU, called Dojo, The Information reported.
Meta, Microsoft, Alphabet, and Amazon are all developing their own AI chips even as they continue to stockpile Nvidia GPUs.
“The difference is that Elon has been working on Dojo for many years now,” Lattner told Business Insider in an email.
Musk has expressed concern about the challenges of acquiring more of Nvidia’s highly sought-after chips and said that his Dojo project will help decrease his company’s dependence on the chipmaker.
“We do see a path to being competitive with Nvidia with Dojo,” Musk said during a Tesla earnings call in July. “We kind of have no choice.”
When talking about Colossus on X in early September, Musk said he aims to double the size of the supercomputer to 200,000 chips in a few months.
He said the cluster was built in just 122 days — an impressive feat that no other company has matched, according to The Information.
It’s unclear if Colossus runs 100,000 GPUs at the same time, which would require sophisticated networking technology and a lot of energy.
“Musk previously said the 100,000-chip cluster was up and running in late June,” The Information reported. “But at that time, a local electric utility said publicly that xAI only had access to a few megawatts of power from the local grid.”
Last month, CNBC reported that an environmental advocacy group had complained that xAI was running gas turbines to produce more power for the data center without authorization.
The outlet reported that the Southern Environmental Law Center wrote in a letter to the local health department that xAI had installed and was operating at least 18 unpermitted turbines “with more potentially on the way” to supplement its massive energy needs.
The local utility, Memphis Light, Gas and Water, told CNBC it has provided 50 megawatts of power to xAI since the beginning of August, but that the facility requires an additional 100 megawatts to operate.
Data cluster developers told The Information that this could only power a few thousand GPUs. Musk’s company would need another electric substation to get enough power to run 100,000 chips.
Hoffman and Musk did not immediately respond to requests for comment from Business Insider.
Read the full article here