The latest details of GPT-4 are exposed: from architecture, infrastructure, training data sets, cost, vision to MoE

OpenAI keeps the GPT-4 architecture closed not because of some existential risk to humans, but because what they build is reproducible. In fact, we expect companies such as Google, Meta, Anthropic, Inflection, Character, Tencent, ByteDance, Baidu, etc. to have models as capable as or even more powerful than GPT-4 in the short term.

Don't get me wrong, OpenAI has amazing engineering capabilities, and what they've built is incredible, but the solutions they've found aren't magic. This is an elegant solution with many complex tradeoffs. Scaling up is only part of the battle. OpenAI's most enduring competitive advantage is that they have the most practical applications, leading engineering talent, and can continue to outperform other companies with future models.

We have gathered a wealth of information about GPT-4 from multiple sources, and today we want to share it. This includes model architecture, training infrastructure, inference infrastructure, number of parameters, training dataset composition, number of tokens, number of layers, parallel strategies, multimodal vision adaptation, thought process behind different engineering tradeoffs, unique techniques implemented, and How they alleviate some of the biggest bottlenecks associated with inference on huge models.

The most interesting aspect of GPT-4 is understanding why they made certain architectural decisions.

Additionally, we outline the cost of training and inferring GPT-4 on the A100, and how it scales with the H100 in next-generation model architectures.

First, let's look at the problem statement. From GPT-3 to 4, OpenAI hopes to expand 100 times, but the problem is cost. Dense Transformer models will not scale further. Dense Transformer is the model architecture used by OpenAI GPT-3, Google PaLM, Meta LLAMA, TII Falcon, MosaicML MPT and other models. We could easily name over 50 companies that use this same architecture to train LLMs. It's a nice architecture, but it's flawed for scaling.

Before the release of GPT-4, we discussed the relationship between training costs and the upcoming AI brick wall. There, we reveal OpenAI's high-level approach to the GPT-4 architecture and the training cost of various existing models.

Over the past six months, we've realized that training costs are irrelevant.

Sure, it might seem crazy on the surface, spending tens or even hundreds of millions of dollars of computing time to train a model, but for these companies, that's an insignificant expense. It's really a fixed capex that always yields better results when it comes to scaling. The only limiting factor is scaling the computation to a timescale where humans can get feedback and modify the architecture.

Over the next few years, multiple companies like Google, Meta, and OpenAI/Microsoft will train models on supercomputers worth over $100 billion. Meta burns $16 billion a year on the "Metaverse", Google wastes $10 billion a year on various projects, Amazon loses over $50 billion on Alexa, crypto wastes over $100 billion on worthless things .

These companies and society at large can and will spend over a hundred billion dollars on creating supercomputers that can train a single gigantic model. These huge models can then become products in a number of ways. This work will be replicated across multiple countries and companies. This is a new space race. Unlike the waste of the past, today's AI has tangible value that will be gained in the short term from human assistants and autonomous agents.

A more important problem in scaling AI is inference.

The goal is to separate training computation from inference computation. This is why meaningful training is beyond the best of Chinchilla, regardless of the model that will be deployed. This is why a sparse model architecture is used; during inference, not every parameter needs to be activated.

The real challenge is that scaling these models to users and agents is too costly. The cost of inference is many times higher than the cost of training. This is the innovative goal of OpenAI in terms of model architecture and infrastructure.

Inference with large models is a multivariate problem, and for dense models, model size is fatal. We discussed the issues related to edge computing in detail here, but the problem statement for the data center is very similar. In simple terms, devices will never have enough memory bandwidth to achieve a specific level of throughput for large language models. Even if the bandwidth is sufficient, the utilization of hardware computing resources on edge computing devices will be very low.

In the data center, in the cloud, utilization is critical. Half of the reason why Nvidia is admired for its superior software is because throughout the life of the GPU, Nvidia continuously updates the low-level software by moving data more intelligently within the chip, between chips and between memory. , to increase the utilization of FLOPS.

In most current use cases, the goal of LLM inference is to run as a real-time assistant, which means that it must achieve high enough throughput for users to actually use it. The average human reads at about 250 words per minute, but some people even go as high as 1000 words per minute. This means you need to output at least 8.33 tokens per second, but closer to 33.33 tokens per second to cover all cases.

Given the memory bandwidth requirements, a mega-parameter dense model is mathematically incapable of achieving this kind of throughput on the latest Nvidia H100 GPU servers.

Every generated token requires every parameter to be loaded from memory onto the chip. The generated token is then fed into the prompt and the next token is generated. Additionally, additional bandwidth is required to stream the KV cache for the attention mechanism.

This graph assumes that efficiency is equivalent to parameter reads due to inability to fuse each operation, memory bandwidth required by the attention mechanism, and hardware overhead. In fact, even with an "optimized" library like Nvidia's FasterTransformer library, the overall overhead is greater.

The graph above shows the memory bandwidth required to infer an LLM to achieve high enough throughput to serve a single user. It shows that even with 8 H100s, it is not possible to serve a 1 megaparameter dense model at 33.33 tokens per second.

Furthermore, the FLOPS utilization using 8 H100s at 20 tokens per second is still less than 5%, resulting in very high inference costs. In fact, the current H100 system based on 8-way tensor parallelism has an inference limit of about 300 billion forward parameters.

However, OpenAI is using the A100 to achieve human reading speed, using over 1 trillion model parameters, and making it widely available at a low price of just $0.06 per 1,000 tokens. This is because it is sparse, i.e. not every parameter is used.

About GPT-4's model architecture, training infrastructure, inference infrastructure, number of parameters, training dataset composition, number of tokens, number of layers, parallel strategy, multimodal visual encoder, thought process behind different engineering tradeoffs, implementation unique techniques and how they alleviate some of the biggest bottlenecks associated with inferencing large models.

#1 GPT-4 Model Architecture

GPT-4 is more than 10 times larger than GPT-3. As far as we know, it has about 1.8 trillion parameters, spread over 120 layers, while GPT-3 has about 175 billion parameters.

OpenAI managed to control the cost by using a Mixture of Experts (MoE) model. If you're not familiar with MoE, read our six-month-old article on the generalized GPT-4 architecture and training costs.

In addition, OpenAI uses 16 experts in its model, and the MLP parameters of each expert are about 111 billion. There are 2 experts routing to each forward pass.

While the literature talks about advanced routing algorithms for choosing which expert to route each token to, OpenAI's current GPT-4 model's routing algorithm is said to be fairly simple.

Furthermore, the attention mechanism shares approximately 55 billion parameters.

Each forward pass of inference (generating 1 token) only uses about 280 billion parameters and 560 TFLOPS. This is in contrast to the ~1.8M parameters and 3700 TFLOPS required per forward pass for a purely dense model.

#2 Data Integration

OpenAI trained GPT-4 on about 13 trillion tokens. This makes sense considering that RefinedWeb's CommonCrawl contains about 5 megabytes of high-quality tokens. For reference, Deepmind's Chinchilla model and Google's PaLM model used about 1.4 mega tokens and 0.78 mega tokens respectively for training. It is even claimed that PaLM 2 was trained on about 5 trillion tokens.

This dataset does not contain 13 trillion unique tokens. In contrast, this dataset contains multiple epochs due to the lack of high-quality tokens. There are 2 epochs for text data and 4 epochs for code data. Interestingly, this is nowhere near the best option for Chinchilla, showing that the model needs to be trained with double the number of tokens. This shows a lack of easily obtainable tokens on the web. There are 1000x as many high-quality text tokens and even more audio and visual tokens, but getting them is not as simple as web scraping.

They have millions of lines of instruction fine-tuning data from Scale Al and internally, but sadly, we can't find much reinforcement learning data on them.

The context length in the pre-training stage is 8k. The 32k token length version is fine-tuned on top of the pre-trained 8k.

The batch size was gradually increased over several days, but at the end, OpenAI was using a batch size of 60 million! Of course, since not every expert sees all tokens, this is really just 7.5 million tokens per batch per expert.

#3 Parallel Strategies

The strategy for parallelization on all A100 GPUs is very important. They used 8-way tensor parallelism because this is the limit of NVLink. Also, we've heard they're using 15-way pipelines in parallel. From a computation time and data communication point of view, theoretically the number of parallel pipelines is too much, but if they are limited by memory capacity, then it makes sense.

In pure pipeline + tensor parallelism, each GPU requires about 30GB (FP16) for parameters alone. Once you add in the KV cache and overhead, it would theoretically make sense if most of OpenAI's GPUs are the 40GB A100. They probably used ZeRo Phase 1. Probably they used block-level FSDP or mixed shared data parallelism.

As for why they didn't use the full model FSDP, it might be because of the higher communication overhead. While most of OpenAI's nodes have high-speed network connections between them, not all of them do. We believe that at least some clusters have much lower bandwidth between clusters than others.

We don't understand how they can avoid huge bubbles per batch with such high pipeline parallelism. Chances are they just covered the overhead.

#4 Training Cost

OpenAI used approximately 25,000 A100 chips for training on GPT-4, achieving an MFU (mean function utilization) of approximately 32% to 36% over a period of 90 to 100 days. This extremely low utilization is partly due to the high number of failures requiring restarts from checkpoints, the aforementioned bubbles being very costly.

Another reason is that a global reduction across so many GPUs is very expensive. If our guess is correct, the cluster is actually made up of many smaller clusters with very thin network connections between them, i.e. 800G/1.6T non-blocking connections between different parts of the cluster, but These parts can only be connected at 200G/400G speed.

If they have an A100 chip in the cloud that costs about $1 an hour, that training alone costs about $63 million. This does not take into account all the experiments, failed training runs, and other costs such as data collection, reinforcement learning, and personnel costs. Because of these factors, the actual cost is much higher. Also, it means you need someone to buy the chip/network/data center, cover the capex and lease it to you. 

Currently, pre-training can be done in about 55 days at $2 an hour using about 8,192 H100 chips, at a cost of about $21.5 million. It should be noted that we believe that 9 companies will have more H100 chips by the end of this year. Not all of these companies will use them all for a single training run, but those that do will have models at a much larger scale. Meta will have over 100,000 H100 chips by the end of the year, but quite a few of those chips will be distributed across their data centers for inferencing. Their largest single cluster will still be over 25,000 H100 chips. 

By the end of this year, many companies will have enough computing resources to train models on the scale of GPT-4.

#5 Tradeoffs of MoE

During inference, MoE is a good way to reduce the number of parameters while increasing the number of parameters during inference, which is necessary to encode more information per training token, because obtaining enough high-quality tokens Cards are very difficult. If OpenAI really tried to achieve Chinchilla optimization, they would have to use twice the current number of tokens in training.

Still, OpenAI makes several tradeoffs. For example, during inference, MoE is very difficult to deal with, because every part of the model is not used at every token generation. This means that some parts may be idle while others are being used while serving users. This has a large negative impact on utilization.

Researchers have shown that using 64 to 128 experts has less loss than using 16 experts, but that is pure research. There are several reasons for reducing the number of experts. One of the reasons OpenAI chose 16 experts is because more experts have a hard time generalizing on many tasks. It may also be more difficult to achieve convergence using more experts. In such a large training run, OpenAI chose to be more conservative in the number of experts.

Furthermore, reducing the number of experts also helps their inference infrastructure. There are various difficult trade-offs when adopting expert mixed inference architectures. Before exploring the tradeoffs OpenAI faced and the choices they made, we start with the fundamental tradeoffs of LLM reasoning.

#6 Reasoning Tradeoffs

By the way, before we start, we want to point out that all the LLM companies we talked to thought that Nvidia's FasterTransformer inference library was pretty bad, and TensorRT was even worse. The downside of not being able to use Nvidia's templates and modify them means people need to create their own solutions from scratch. If you work at Nvidia, after reading this article, you need to fix this ASAP, otherwise the default choice will be changed to open tools, so that third-party hardware support can be added more easily. A huge wave of models is coming. If there is no software advantage in inference, and kernels still need to be hand-written, then AMD's MI300 and other hardware will have a bigger market.

In inference for large language models, there are 3 main trade-offs that occur between batch size (number of concurrent users served) and number of chips used.

  1. Latency - The model must respond with reasonable latency. People don't want to wait a few seconds before the output starts flowing into the chat application. Preloading (input tokens) and decoding (output tokens) take different times to process.

  2. Throughput - The model must output a certain number of tokens per second. About 30 tokens per second is required for human use. For various other uses, both lower and higher throughputs are acceptable.

  3. Utilization - The hardware running the model must achieve high utilization, otherwise the cost will be prohibitive. While it is possible to achieve higher utilization by grouping more user requests with higher latency and lower throughput, this increases the difficulty.

LLM's reasoning is all about balancing two main factors: memory bandwidth and computation. In the most oversimplified terms, each parameter has to be read, and associated with it is 2 FLOPs. As a result, the ratio of most chips (e.g. the H100 SXM chip has only 3TB/s of memory bandwidth, but 2,000 TFLOPs/s of FP8) is completely unbalanced in inference with a batch size of 1. If only one user is served, with a batch size of 1, then for each token generation, the required memory bandwidth dominates the inference time. Computation time is almost zero. To efficiently scale large language models to multiple users, the batch size must exceed 4. Multiple users share the cost of parameter reading. For example, with a batch size of 256 or 512, there are 512 FLOP/s or 1024 FLOP/s per byte of memory read.

This ratio is closer to the ratio between memory bandwidth and FLOPS of H100. This helps achieve higher utilization, but at the cost of higher latency.

Many people see memory capacity as a major bottleneck for LLM inference, due to the fact that large models require multiple chips for inference, and a larger memory capacity fits it into fewer chips, but in reality, it is better to use more chips than needed capacity chips so that latency is lower, throughput is higher, and larger batch sizes can be used to achieve higher and higher utilization.

Google demonstrates these tradeoffs in their PaLM inference paper. However, it is worth noting that this is for a dense model like PaLM, not a sparse model like GPT-4. 

If an application requires the lowest latency, we need to apply more chips and divide the model into as many parts as possible. Smaller batch sizes generally allow for lower latency, but smaller batch sizes also result in poorer utilization, resulting in a higher total cost per token (in chip seconds or dollars). If an application requires offline inference, and latency is not an issue, the main goal is to maximize the throughput per chip (i.e. minimize the total cost per token).

Increasing the batch size is most efficient because larger batches generally achieve better utilization, but certain partitioning strategies that are not efficient for small batch sizes become efficient as the batch size increases. More chips and higher batch sizes are cheapest because they increase utilization, but this also introduces a third variable, network time. Certain methods of splitting models across chips are more latency efficient, but trade off with utilization. 

Both memory time and non-attentional computation time are proportional to the model size and inversely proportional to the number of chips. However, for a given partition layout, the time required for inter-chip communication falls off more slowly (or not at all), so it becomes increasingly important as the number of chips increases, becoming an increasingly important bottleneck. While we're only briefly discussing it today, it should be noted that the memory requirements of the KV cache increase dramatically as the batch size and sequence length grow. If an application needs to generate text with longer attention contexts, the inference time will increase significantly.

For a 500B+ model with multi-head attention, the attention KV cache becomes large: for a batch size of 512 and a context length of 2048, the KV cache reaches 3TB in total, which is 3 times the model parameter size. The on-chip memory needs to load this KV cache from the off-chip memory into the memory, and the computing core of the chip is basically idle during this period. Long sequence lengths are particularly bad for memory bandwidth and memory capacity. OpenAI's 16k sequence length GPT 3.5 turbo and 32k sequence length GPT 4 cost much more because they cannot use larger batch sizes due to memory constraints. 

Lower batch sizes result in lower hardware utilization. Also, as the sequence length increases, the KV cache also becomes larger. The KV cache cannot be shared between users, so separate memory reads are required, further bottlenecking memory bandwidth.

#7 Inference Tradeoffs and Infrastructure for GPT-4

All of the above are difficult in GPT-4 inference, but the model architecture employs a Mixture of Experts (MoE), which introduces a whole new set of difficulties. Each token-generated forward pass can be routed to a different set of experts. This poses a problem with the trade-offs achieved between throughput, latency, and utilization at large batch sizes. 

OpenAI's GPT-4 has 16 experts, 2 experts in each forward pass. This means that if the batch size is 8, each expert's parameter read may only be batch size 1. Worse, maybe one expert has a batch size of 8, while others might be 4, 1, or 0. Each time a token is generated, the routing algorithm sends a forward pass in a different direction, causing token-to-token latency and significant variance in the expert batch size. Inference infrastructure is one of the main reasons why OpenAI chose a lower number of experts. If they chose more experts, memory bandwidth would be more of a bottleneck for inference.

OpenAI regularly reaches batch sizes of 4k+ on inference clusters, which means that even with optimal load balancing among experts, the batch size of experts is only ~500. This requires a very large amount of usage to achieve. We learned that OpenAI runs inference on a cluster of 128 GPUs. They have multiple of these clusters across multiple data centers and geographic locations. Inference is performed on 8-way tensor parallelism and 16-way pipeline parallelism. Each node consisting of 8 GPUs has only about 130B parameters, i.e. less than 30GB per GPU in FP16 mode and less than 15GB in FP8/int8 mode. This enables inference to run on the 40GB A100 chip, provided the KV cache size for all batches is not too large. 

A single layer containing various experts would not be split across different nodes, as this would make the network traffic too irregular, and it would be too expensive to recompute the KV cache between each token generation. For any future extension of the MoE model and conditional routing, how to handle the routing of the KV cache is the biggest difficulty. 

The model has 120 layers, so it is trivial to distribute it evenly across 15 different nodes, but since the first node needs to do data loading and embedding, placing fewer layers on the master node of the inference cluster is meaningful. Additionally, we've heard some rumors about speculative decoding for the reasoning, which we'll discuss later, but we're not sure whether to believe them. This would also explain why masternodes need to contain fewer layers.

#8 Inference cost of GPT-4

 

Compared to the Davinchi model with 175B parameters, GPT-4 is 3 times more expensive, although its feed-forward parameters only increase by 1.6 times. This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.

We consider the cost per 1k tokens to be 0.0049 cents for 128 A100s to infer GPT-4 8k sequence lengths and 0.0021 cents per 1k tokens for 128 H100s to infer GPT-4 8k sequence lengths point.

It is worth noting that we assume high utilization and keep the batch size high. This might be a wrong assumption, since it's clear that OpenAI is sometimes very underutilized. We experimented with various new techniques by assuming that OpenAI shuts down the cluster during trough periods, and rescheduled those nodes to resume training on a smaller test model from a checkpoint. This helps reduce inference cost. Had OpenAI not done so, their utilization would have been lower and our cost estimates would have more than doubled.

#9 Multi-query attention

MQA is a technique other companies are using, but we would like to point out that OpenAI is also using it. Long story short, with only one header required, the memory capacity of the KV cache can be greatly reduced. Even so, GPT-4 with 32k sequence length certainly cannot run on the 40GB A100 chip, and GPT-4 with 8k sequence length is limited in the maximum batch size. Without MQA, the maximum batch size of GPT-4 with 8k sequence length would be so limited that it would not be economically viable.

#10 Continuous Batching

 

OpenAI implements variable batch sizes and continuous batching. This allows for maximum latency to some extent and optimizes inference costs. If you're new to the concept, this article by AnyScale is worth a read.

#11 About Guessing Solutions

We have heard from some reliable sources that OpenAI uses guess decoding in its GPT-4 inference. We're not sure we fully believe that. The general variation in token-to-token latency and the difference when doing simple retrieval tasks versus more complex tasks seems to suggest this is possible, but there are too many variables to be sure. Just in case, we'll use some of the text from "Accelerating LLM Inference Using Segmented Guess Decoding" here and slightly modify/add some clarifications.

There are usually two phases to using the LLM. The first is the pre-population stage, which passes the prompt text through the model to generate the KV cache and the logits (probability distribution of possible token outputs) for the first output. Usually, this stage is fast, since the entire hint text can be processed in parallel.

The second stage is decoding. A token is selected from the output logits and fed back into the model to generate the logits of the next token. Repeat this process until the desired number of tokens are generated. Because decoding must happen sequentially, with the weights being streamed through the compute unit each time to generate a single token, the arithmetic intensity of the second stage (i.e. FLOPs of computation / bytes of memory bandwidth) when run in mini-batches very low.

Therefore, decoding is usually the most expensive part of autoregressive generation. This is why in OpenAI's API calls, input tokens are much cheaper than output tokens.

The basic idea of ​​guess decoding is to use a smaller, faster draft model to pre-decode multiple tokens and then feed them to the oracle model as a batch. If the draft model is correct about the token it predicts, i.e. the larger model agrees, then multiple tokens can be decoded in one batch, which saves considerable memory bandwidth and time, per token .

However, if the larger model rejects a token predicted by the draft model, then the remaining batch is discarded and the algorithm naturally reverts to standard token-by-token decoding. Guessive decoding may also be accompanied by a rejection sampling scheme to sample from the original distribution. Note that this is only useful in small batch settings where bandwidth is the bottleneck.

Guessive decoding trades off computation and bandwidth. Guessive decoding is a performance optimization target for two key reasons. First, it does not degrade the model quality at all. Second, it offers advantages that are generally unrelated to other methods, since its performance comes from converting sequential execution to parallel execution.

The current guessing method predicts a single sequence for the batch. However, this does not scale well with large batch sizes or low draft model alignment. Intuitively, the probability of two models agreeing on successively long sequences decreases exponentially, implying that the reward for guessing the decoding diminishes rapidly as the arithmetic intensity scales up.

We think that if OpenAI uses guess decoding, they probably only use it on sequences of about 4 tokens. Incidentally, the whole conspiracy of GPT-4 to lower the quality is probably just because they made the oracle model accept lower probability sequences from the guessing decoding model. Another note is that there has been speculation that Bard used guesswork decoding because Google waited for the generation of the sequence to complete before sending the entire sequence to the user, but we do not believe this speculation to be true.

#12 About visual multimodality

Visual multimodal capabilities are the least impressive part of GPT-4, at least compared to leading research. Certainly, no company has yet commercialized research on multimodal LLM.

It is an independent visual encoder, separate from the text encoder, but with cross-attention. We heard that its architecture is similar to Flamingo. This adds more parameters on top of GPT-4's 1.8T parameters. After text-only pre-training, it is also fine-tuned on another ~2 trillion tokens.

For the vision model, OpenAI had hoped to train from scratch, but this method was not mature enough, so they decided to start with text to mitigate the risk.

The next model, GPT-5, will allegedly be trained for vision from scratch and be able to generate images on its own. Additionally, it will also be able to handle audio.

One of the main purposes of this vision capability is to enable autonomous agents to read web pages and transcribe content from images and videos. Some of the data they train on is joint data (rendered LaTeX/text), screenshots of webpages, YouTube videos: sample frames, and run Whisper to get transcriptions.

The interesting thing about all this over-optimization for LLM is that the cost of the visual model is not the same as the cost of the text model. In the text model, as we described in our "Amazon Cloud Crisis" article, the cost is very low. While in the vision model, the IO for data loading is about 150 times higher. Each token is 600 bytes, not 4 for text. There is a lot of research going on in image compression.

This is very important for hardware vendors who are optimizing their hardware based on the use cases and ratios of LLM in the next 2-3 years. They may find themselves in a world where every model has powerful visual and audio capabilities. They may find that their schema is maladaptive. Overall, architectures are sure to evolve to a stage beyond current simplified text-based dense and/or MoE models.

Guess you like

Origin blog.csdn.net/elinkenshujuxian/article/details/131736096