AI models and the absurd machines that run them

There have been a number of threads recently about the generative AI stuff getting a lot of media attention lately.

There has been some confusion and discussion regarding how the models work and what sort of systems they run on. I thought I might offer some information for the tech-interested of the forum.

All of these models are types of neural networks. In computing, neural networks are a type of algorithm that was very vaguely inspired by biological neural systems, but an AI neural network has virtually nothing in common with a biological neural system and is in NO WAY remotely an emulation of a biological neural system.

You can think of a neural network as a giant flowchart with nodes arranged in layers. Each node does a simple computation on its inputs, and then sends the result to a bunch of nodes in the next layer. There's an input layer at the beginning and an output layer at the end. Because there are so many connections between each layer, it is absolutely critical that the entire model run on a single machine - moving data around inside one machine is fast, moving it between multiple machines is much, much slower and would absolutely cripple the performance of the model.

Taking ChatGPT as an example, the underlying AI model is a text completion model. You feed the input layer a piece of text, and what comes out at the output layer is a continuation of the input text that more or less "looks like" what a human would have written next.

There are two separate things that define a neural network. First, its organization or architecture - how many layers, how many nodes in each layer, which nodes in the next layer each node connects to, and what mathematical function each node performs. The other part is the "parameters" - the specific coefficients of the mathematical functions in each node. Let's say we have a node that takes one input from the last layer, and multiplies it by a constant, and outputs it to the next layer: output = constant * input. The value of the constant is a parameter for that node.

This setup entails a huge number of simple calculations that can be done in parallel (you can do every node in a layer in parallel, they only depend on the last layer) in a similar way that computer graphics does (you can do every pixel independently), so over time, AI algorithms have evolved to run more efficiently on graphics cards, and graphics cards have evolved to more efficiently run AI algorithms.

Now virtually all AI models run on specialized graphics cards.

But how? ChatGPT runs on a version of the GPT-3 model that has 175 billion 16-bit parameters. That means that JUST the parameters, not including the data about the architecture of the network, would take up 350 GB of GPU RAM.

The best gaming graphics card currently available, the NVIDIA RTX 4090, has a $1600 MSRP and only 24 GB of RAM. You'd need 15 of them just to hold the GPT-3 model parameters, and there aren't any computers that can run 15 of them and the RTX 4090 isn't really designed to work in multi-GPU systems anyway.

Well, it turns out that there is a whole separate product category of "graphics cards" that are designed for industrial AI applications. The current top of the heap is the NVIDIA H100, that costs around $30,000 each and has 80 GB of RAM (and VASTLY more processing power than an RTX 4090).

But that still isn't enough to run GPT-3, so what gives? Industrial AI servers are typically multi-GPU systems. An H100-based server can have up to 8 H100 GPUs - so $240,000 worth of "graphics cards" with 640 GB of GPU RAM. These AI servers use a special NVLink bridge between the GPUs that gives each GPU direct access to the RAM of all the other GPUs with 900 GB/s bandwidth (that's not a type - gigaBYTES per second, not gigaBITS per second). Now we can run GPT-3.

Models are going to keep getting bigger, though, and GPT-3 already takes up most of that 640 GB of GPU RAM, so where do we go from there?

That was actually the impetus for writing this post - seeing NVIDIA's announcement about their next-gen system. The GH200 is a combined CPU/GPU "superchip." The CPU has 480 GB of RAM and the GPU has 96 GB of RAM. They can still be used in groups of up to 8 (in total, almost 4 TB of CPU RAM and 768 GB of GPU RAM) with 900 GB/s bandwidth direct memory access between GPUs.

CPUs aside, that isn't THAT big a bump... except that NVIDIA has developed a second layer of NVLINK that extends the 900 GB/s direct memory access between GPUs to up to THIRTY-TWO 8-GPU groups. That's 256 of the chips, all in one machine, all with 900 GB/s direct memory access to each other.

That's effectively ONE SERVER with 256 CPUs with 122 TB of RAM and 256 GPUs with 24 TB of VRAM (literally 1,000 times more VRAM than an RTX 4090). Based on the prices of current-gen chips, the 256-chip machine would probably run you in the neighborhood of $15,000,000.

Posts: 6323 | Location: CA | Registered: January 24, 2011

Bytes

Member

posted

	View $GS_USERNAME's Public Profile
	Add $GS_USERNAME to my Ignore ListRemove $GS_USERNAME from my Ignore List
	View Recent Posts by $GS_USERNAME
	Notify me of New Posts by $GS_USERNAME

Quick Reply to: AI models and the absurd machines that run them
Guest Name

Close \| Use Full Posting Form \| Quick Quote