How 16, 8 and 4-bit floating point numbers work

Since the first edition of Kernighan and Ritchie and their C language book 50 years ago, it has been known that single-precision "float" types have a size of 32 bits and double-precision types have a size of 64 bits. There is also an 80-bit "long double" type with extended precision, and these types cover almost all your needs for floating-point data processing. But in recent years, especially with the rise of LLM this year, in order to reduce the storage and memory footprint of the model, developers have begun to shrink the floating point type as much as possible.

In this article, we will introduce the most popular floating point formats, create a simple neural network, and understand how it works.

"Standard" 32-bit floating point numbers

Let's review the standard format first. The IEEE 754 floating-point arithmetic standard was developed by IEEE in 1985. A typical number of 32 floats looks like this:

The first bit is a symbol, the next 8 bits represent an exponent, and the last bit represents the mantissa. The final value is calculated as:

We create a helper function to print floating point values ​​in binary form:

 import struct
 
 
 def print_float32(val: float):
     """ Print Float32 in a binary form """
     m = struct.unpack('I', struct.pack('f', val))[0]
     return format(m, 'b').zfill(32)
 
 
 print_float32(0.15625)
 
 # > 00111110001000000000000000000000 

Create another inverse transformation function, which will be useful later:

 def ieee_754_conversion(sign, exponent_raw, mantissa, exp_len=8, mant_len=23):
     """ Convert binary data into the floating point value """
     sign_mult = -1 if sign == 1 else 1
     exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
     mant_mult = 1
     for b in range(mant_len - 1, -1, -1):
         if mantissa & (2 ** b):
             mant_mult += 1 / (2 ** (mant_len - b))
 
     return sign_mult * (2 ** exponent) * mant_mult
 
 
 ieee_754_conversion(0b0, 0b01111100, 0b01000000000000000000000)
 
 #> 0.15625

As a developer, you surely know that floating point types have limited accuracy, like this one:

 val = 3.14
 print(f"{val:.20f}")
 
 # > 3.14000000000000012434

In general, this isn't a big problem, but the fewer bits we have, the lower the accuracy we get.

16-bit floating point number

There was not much demand for this format in the early days, and it was not until 2008 that the 16-bit floating point type was added to the IEEE 754 standard. It has a sign bit, 5 exponent bits and 10 mantissa (fraction) bits:

Its conversion logic is the same as that of 32-bit floating point numbers, but with lower precision. Print a 16-bit floating point number in binary form:

 import numpy as np
 
 
 def print_float16(val: float):
     """ Print Float16 in a binary form """
     m = struct.unpack('H', struct.pack('e', np.float16(val)))[0]
     return format(m, 'b').zfill(16)
 
 print_float16(3.14)
 
 # > 0100001001001000

Using the method we used before, we can do the reverse transformation:

 ieee_754_conversion(0, 0b10000, 0b1001001000, exp_len=5, mant_len=10)
 
 # > 3.140625

We can also find the maximum value that can be represented in Float16:

 ieee_754_conversion(0, 0b11110, 0b1111111111, exp_len=5, mant_len=10)
 
 #> 65504.0

0b11110 is used here because in the IEEE 754 standard, 0b11111 is reserved for "infinity". In the same way, you can also find the possible minimum value:

 ieee_754_conversion(0, 0b00001, 0b0000000000, exp_len=5, mant_len=10)
 
 #> 0.00006104

For most developers, types like this are "uncharted territory" because there are no standard 16-bit floating point types in C++.

16-bit " bfloat " (BFP16)

This floating-point format was developed by a team at Google specifically designed for machine learning (the "B" in the name also stands for "brain"). This type is a modification of "standard" 16-bit floating point: the exponent is expanded to 8 bits, so the dynamic range of "bfloat16" is effectively the same as float-32. But the size of the mantissa is reduced to 7 bits:

Let's do a similar calculation as before:

 ieee_754_conversion(0, 0b10000000, 0b1001001, exp_len=8, mant_len=7)
 
 #> 3.140625

You can see that the bfloat16 format has a wider range due to the larger exponent:

 ieee_754_conversion(0, 0b11111110, 0b1111111, exp_len=8, mant_len=7)
 
 #> 3.3895313892515355e+38

This is much better than 65504.0 in the previous example, but as mentioned before: bfloat16 is less precise because there are fewer digits in the mantissa. You can test both types in Tensorflow:

 import tensorflow as tf
 
 
 print(f"{tf.constant(1.2, dtype=tf.float16).numpy().item():.12f}")
 
 # > 1.200195312500
 
 print(f"{tf.constant(1.2, dtype=tf.bfloat16).numpy().item():.12f}")
 
 # > 1.203125000000

8-bit floating point (FP8)

This (relatively new) format was proposed in 2022 and was also created for machine learning - as models get larger, getting them into GPU memory is a challenge. There are two variants of the FP8 format: E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa).

Let's get the maximum possible value in both formats:

 ieee_754_conversion(0, 0b1111, 0b110, exp_len=4, mant_len=3)
 
 # > 448.0
 
 ieee_754_conversion(0, 0b11110, 0b11, exp_len=5, mant_len=2)
 
 # > 57344.0

It is also possible to use FP8 with Tensorflow:

 import tensorflow as tf
 from tensorflow.python.framework import dtypes
 
 
 a_fp8 = tf.constant(3.14, dtype=dtypes.float8_e4m3fn)
 print(a_fp8)
 
 # > 3.25
 
 a_fp8 = tf.constant(3.14, dtype=dtypes.float8_e5m2)
 print(a_fp8)
 
 # > 3.0

Let's draw a sine wave in these two types:

 import numpy as np
 import tensorflow as tf
 from tensorflow.python.framework import dtypes
 import matplotlib.pyplot as plt
 
 
 length = np.pi * 4
 resolution = 200
 xvals = np.arange(0, length, length / resolution)
 wave = np.sin(xvals)
 wave_fp8_1 = tf.cast(wave, dtypes.float8_e4m3fn)
 wave_fp8_2 = tf.cast(wave, dtypes.float8_e5m2)
 
 plt.rcParams["figure.figsize"] = (14, 5)
 plt.plot(xvals, wave_fp8_1.numpy())
 plt.plot(xvals, wave_fp8_2.numpy())
 plt.show()

As you can see, there are some differences, but they are not bad.

Some loss of accuracy is clearly visible, but the image still looks like a sine wave!

4-bit floating point type

Now let’s look at the craziest thing of all – 4-bit floating point values! A 4-bit floating point number (FP4) is the smallest possible value following the IEEE standard, with 1 bit sign, 2 bits exponent and 1 bit mantissa:

The second possible 4-bit implementation is the so-called NormalFloat (NF4) data type. NF4 values ​​are optimized for preserving normally distributed variables. This is difficult to do with other data types, but all possible NF4 values ​​can be easily printed in a list:

 [-1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453, 
  -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0,
   0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224, 
   0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0]

Both FP4 and NF4 types have corresponding implementations in the bitsandbytes library. As an example, let's convert the [1.0,2.0,3.0,4.0] array to FP4:

 from bitsandbytes import functional as bf
 
 
 def print_uint(val: int, n_digits=8) -> str:
     """ Convert 42 => '00101010' """
     return format(val, 'b').zfill(n_digits)
 
 
 device = torch.device("cuda")
 x = torch.tensor([1.0, 2.0, 3.0, 4.0], device=device)
 x_4bit, qstate = bf.quantize_fp4(x, blocksize=64)
 
 print(x_4bit)
 # > tensor([[117], [35]], dtype=torch.uint8)
 
 print_uint(x_4bit[0].item())
 # > 01110101
 print_uint(x_4bit[1].item())
 # > 00100011
 
 print(qstate)
 # > (tensor([4.]), 
 # >  'fp4', 
 # >  tensor([ 0.0000,  0.0052,  0.6667,  1.0000,  0.3333,  0.5000,  0.1667,  0.2500,
 # >           0.0000, -0.0052, -0.6667, -1.0000, -0.3333, -0.5000, -0.1667, -0.2500])])

As output, we get two objects: a 16-bit array [117,35], which actually contains our 4 numbers, and a "state" object, which contains a scaling factor of 4.0 and a tensor of all 16 FP4 numbers.

For example, the first 4-digit number is "0111" (=7), and in the status object we can see that the corresponding floating point value is 0.25; 0.25 4 = 1.0. The second number is "0101" (=5), and the result is 0.5 4 = 2.0. For the third number, "0010" is 2,0.666*4 = 2.666, which is close to but not equal to 3.0. There is obviously some loss of precision for 4-bit values. For the last value, "0011" is 3,1000 *4 = 4.0.

Reverse conversion does not require manual operation. bitsandbytes can do it automatically for us.

 x = bf.dequantize_fp4(x_4bit, qstate)
 print(x)
 
 # > tensor([1.000, 2.000, 2.666, 4.000])

The 4-bit format also has a limited dynamic range. For example, the array [1.0,2.0,3.0,64.0] will be converted to [0.333,0.333,0.333,64.0]. But for normalized data, it is still acceptable. As an example, let's draw a sine wave in FP4 format:

 import matplotlib.pyplot as plt
 import numpy as np
 from bitsandbytes import functional as bf
 
 
 length = np.pi * 4
 resolution = 256
 xvals = np.arange(0, length, length / resolution)
 wave = np.sin(xvals)
 
 x_4bit, qstate = bf.quantize_fp4(torch.tensor(wave, dtype=torch.float32, device=device), blocksize=64)
 dq = bf.dequantize_fp4(x_4bit, qstate)
 
 plt.rcParams["figure.figsize"] = (14, 5)
 plt.title('FP8 Sine Wave')
 plt.plot(xvals, wave)
 plt.plot(xvals, dq.cpu().numpy())
 plt.show()

You can see the loss of accuracy:

As a special note, at the time of writing this article, the 4-bit type NF4 is only available in CUDA; CPU computing is not currently supported.

test

As the final step in this article, we create a neural network model and test it. Using the transformers library, you can load the pretrained model in 4-bit by setting the load_in_4-bit parameter to True. But that doesn't allow us to understand how it works. So we will create a small neural network, train it and use it with 4 digits of precision.

First, let's create a neural network model:

 import torch
 import torch.nn as nn
 import torch.optim as optim
 from typing import Any
 
 
 class NetNormal(nn.Module):
     def __init__(self):
         super().__init__()
         self.flatten = nn.Flatten()
         self.model = nn.Sequential(
             nn.Linear(784, 128),
             nn.ReLU(),
             nn.Linear(128, 64),
             nn.ReLU(),
             nn.Linear(64, 10)
         )
       
     def forward(self, x):
         x = self.flatten(x)
         x = self.model(x)
         return F.log_softmax(x, dim=1)

We use the MNIST dataset, which is divided into 60,000 training images and 10,000 test images; the selection can be specified in the DataLoader using the parameter train=True|False.

 from torchvision import datasets, transforms
 
 
 train_loader = torch.utils.data.DataLoader(
     datasets.MNIST("data", train=True, download=True,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
     batch_size=batch_size, shuffle=True)
 
 test_loader = torch.utils.data.DataLoader(
     datasets.MNIST("data", train=False, transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
     batch_size=batch_size, shuffle=True)

The training process proceeds in the "normal" way, using default accuracy:

 device = torch.device("cuda")
 
 batch_size = 64
 epochs = 4
 log_interval = 500
 
 def train(model: nn.Module, train_loader: torch.utils.data.DataLoader,
           optimizer: Any, epoch: int):
     """ Train the model """
     model.train()
     for batch_idx, (data, target) in enumerate(train_loader):
         data, target = data.to(device), target.to(device)
         optimizer.zero_grad()
         output = model(data)
         loss = F.nll_loss(output, target)
         loss.backward()
         optimizer.step()
         
         if batch_idx % log_interval == 0:
             print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}]\tLoss: {loss.item():.5f}')
             
             
 def test(model: nn.Module, test_loader: torch.utils.data.DataLoader):
     """ Test the model """
     model.eval()
     test_loss = 0
     correct = 0
     with torch.no_grad():
         for data, target in test_loader:
             data, target = data.to(device), target.to(device)
             t_start = time.monotonic()
             output = model(data)
             test_loss += F.nll_loss(output, target, reduction='sum').item()
             pred = output.argmax(dim=1, keepdim=True)
             correct += pred.eq(target.view_as(pred)).sum().item()
 
     test_loss /= len(test_loader.dataset)
     t_diff = time.monotonic() - t_start
 
     print(f"Test set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset)}%)\n")
 
 
 def get_size_kb(model: nn.Module):
     """ Get model size in kilobytes """
     size_model = 0
     for param in model.parameters():
         if param.data.is_floating_point():
             size_model += param.numel() * torch.finfo(param.data.dtype).bits
         else:
             size_model += param.numel() * torch.iinfo(param.data.dtype).bits
     print(f"Model size: {size_model / (8*1024)} KB")
 
 
 # Train
 model = NetNormal().to(device)
 optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
 for epoch in range(1, epochs + 1):
     train(model, train_loader, optimizer, epoch)
     test(model, test_loader)
 
 get_size(model)
 
 # Save
 torch.save(model.state_dict(), "mnist_model.pt")

There is also a "get_size_kb" method here to get the model size in kb.

The training process is as follows:

 Train Epoch: 1 [0/60000] Loss: 2.31558
 Train Epoch: 1 [32000/60000] Loss: 0.53704
 Test set: Average loss: 0.2684, Accuracy: 9225/10000 (92.25%)
 
 Train Epoch: 2 [0/60000] Loss: 0.19791
 Train Epoch: 2 [32000/60000] Loss: 0.17268
 Test set: Average loss: 0.1998, Accuracy: 9401/10000 (94.01%)
 
 Train Epoch: 3 [0/60000] Loss: 0.30570
 Train Epoch: 3 [32000/60000] Loss: 0.33042
 Test set: Average loss: 0.1614, Accuracy: 9530/10000 (95.3%)
 
 Train Epoch: 4 [0/60000] Loss: 0.20046
 Train Epoch: 4 [32000/60000] Loss: 0.19178
 Test set: Average loss: 0.1376, Accuracy: 9601/10000 (96.01%)
 
 Model size: 427.2890625 KB

Our simple model achieved 96% accuracy with a neural network size of 427 KB.

Next we replace the "Linear" layer with "Linear8bitLt".

 from bitsandbytes.nn import Linear8bitLt
 
 
 class Net8Bit(nn.Module):
     def __init__(self):
         super().__init__()
         self.flatten = nn.Flatten()
         self.model = nn.Sequential(
             Linear8bitLt(784, 128, has_fp16_weights=False),
             nn.ReLU(),
             Linear8bitLt(128, 64, has_fp16_weights=False),
             nn.ReLU(),
             Linear8bitLt(64, 10, has_fp16_weights=False)
         )
       
     def forward(self, x):
         x = self.flatten(x)
         x = self.model(x)
         return F.log_softmax(x, dim=1)
 
 
 device = torch.device("cuda")
 
 # Load
 model = Net8Bit()
 model.load_state_dict(torch.load("mnist_model.pt"))
 get_size_kb(model)
 print(model.model[0].weight)
 
 # Convert
 model = model.to(device)
 
 get_size_kb(model)
 print(model.model[0].weight)
 
 # Run
 test(model, test_loader)

The result is as follows:

 Model size: 427.2890625 KB
 Parameter(Int8Params([[ 0.0071,  0.0059,  0.0146,  ...,  0.0111, -0.0041,  0.0025],
             ...,
             [-0.0131, -0.0093, -0.0016,  ..., -0.0156,  0.0042,  0.0296]]))
 
 Model size: 107.4140625 KB
 Parameter(Int8Params([[  9,   7,  19,  ...,  14,  -5,   3],
             ...,
             [-21, -15,  -3,  ..., -25,   7,  47]], device='cuda:0',
            dtype=torch.int8))
 
 Test set: Average loss: 0.1347, Accuracy: 9600/10000 (96.0%)

The original model is loaded in standard floating point format; its size is the same and the weights look like [0.0071, 0.0059, …]. The model size has been reduced by 4 times. As we can see, the weight values ​​are in the same range, so the conversion is easy - during the test run, there was no accuracy loss at all!

Continue with the 4bit version:

 from bitsandbytes.nn import LinearFP4, LinearNF4
 
 
 class Net4Bit(nn.Module):
     def __init__(self):
         super().__init__()
         self.flatten = nn.Flatten()
         self.model = nn.Sequential(
             LinearFP4(784, 128),
             nn.ReLU(),
             LinearFP4(128, 64),
             nn.ReLU(),
             LinearFP4(64, 10)
         )
       
     def forward(self, x):
         x = self.flatten(x)
         x = self.model(x)
         return F.log_softmax(x, dim=1)
 
 
 # Load
 model = Net4Bit()
 model.load_state_dict(torch.load("mnist_model.pt"))
 get_model_size(model)
 print(model.model[2].weight)
 
 # Convert
 model = model.to(device)
 
 get_model_size(model)
 print(model.model[2].weight)
 
 # Run
 test(model, test_loader)

The output looks like this:

 Model size: 427.2890625 KB
 Parameter(Params4bit([[ 0.0916, -0.0453,  0.0891,  ...,  0.0430, -0.1094, -0.0751],
             ...,
             [-0.0079, -0.1021, -0.0094,  ..., -0.0124,  0.0889,  0.0048]]))
 
 Model size: 54.1015625 KB
 Parameter(Params4bit([[ 95], [ 81], [109],
             ...,
             [ 34], [ 46], [ 33]], device='cuda:0', dtype=torch.uint8))
 
 Test set: Average loss: 0.1414, Accuracy: 9579/10000 (95.79%)

The model size was reduced by 8 times, from 427 KB to 54 KB, but the accuracy dropped by 1%. How is this possible? The answer, at least for this model, is simple:

  • The weights are more or less evenly distributed, and the accuracy loss is not too great.
  • Neural networks use Softmax as output, and the index of the maximum determines the actual result. So for finding the maximum index, the value itself doesn't matter. For example, it makes no difference if the value is 0.8 or 0.9 when the other values ​​are 0.1 or 0.2!

We load the numbers from the test dataset and check the model output.

 dataset = datasets.MNIST('data', train=False, transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ]))
 
 np.set_printoptions(precision=3, suppress=True)  # No scientific notation
 
 data_in = dataset[4][0]
 for x in range(28):
     for y in range(28):
         print(f"{data_in[0][x][y]: .1f}", end=" ")
     print()

The printout shows the numbers we want to predict:

Let's see what the "standard" model will return:

 # Suppress scientific notation
 np.set_printoptions(precision=2, suppress=True)  
 
 # Predict
 with torch.no_grad():
     output = model(data_in.to(device))
     print(output[0].cpu().numpy())
     ind = output.argmax(dim=1, keepdim=True)[0].cpu().item()
     print("Result:", ind)
 
 # > [ -8.27 -13.89  -6.89 -11.13  -0.03  -8.09  -7.46  -7.6   -6.43  -3.77]
 # > Result: 4

The largest element is at position 5 (elements in numpy arrays are numbered starting from 0), corresponding to the number "4".

This is the output of the 8-bit: model:

 # > [ -9.09 -12.66  -8.42 -12.2   -0.01  -9.25  -8.29  -7.26  -8.36  -4.45]
 # > Result: 4

4-bit is as follows:

 # > [ -8.56 -12.12  -7.52 -12.1   -0.01  -8.94  -7.84  -7.41  -7.31  -4.45]
 # > Result: 4

You can see that the actual output values ​​are slightly different, but the maximum index remains the same.

Summarize

In this article, we tested different scenarios with 16-bit, 8-bit, and 4-bit floating point numbers, created a neural network, and were able to run it with 8-bit and 4-bit precision. By reducing the precision from standard floating point to 4-bit floating point, the memory footprint is reduced by a factor of 8 with minimal loss of precision.

As we mentioned in yesterday's article, even 4 bits is no longer the limit; in the GPTQ paper, it was mentioned to quantize weights into 2 or even 3 bits (1.5 bits!). There is also ExLlamaV2 which can apply different quantization to different layers.

https://avoid.overfit.cn/post/51c2993a2f824910b241199a52d2c994

Author: Dmitrii Eliuseev

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/133500190