Floating point processing

This article first appeared personal blog

Float expression

IEEE754 standard is the IEEE standard for floating-point operations specification for floating-point standard to solve the confusion problem. Shortly after it has been certified, almost all of the processor manufacturers have adopted this standard, which greatly promoted the development of the software. Stored in the floating-point format is as follows:

7241055-f009ace2a70993bd.png
float.png

Three parts sign bit floating point, mantissa and exponent, the formula expressed by the following formula:

in IEEE754 standards, the main provisions of single precision floating point (float) and double precision floating point (double) two kinds of floating-point numbers:

Types of Sign digit Index-digit Mantissa digits
Single-precision floating-point (float) 1 8 23
Double precision floating point (double) 1 11 52

Consider first the sign bit when the sign bit is 0, indicating that the number is positive, the sign bit is 1, it indicates that the number is negative. Index can be negative, general use shift said shift is expressed as:

E is a real index, e is stored in the floating-point mantissa, a shift BIAS with. Single precision floating point, for example, the index number of bits, there is the relationship between bias = 127, and stored as the real index, representing the range -126 ~ 127 (e = 0 and e = 255 for special characters). The mantissa is normalized mantissa, i.e., before the binary representation of the mantissa has a hidden binary 1, that is expressed as follows:

when e = 0, the number of the non-normalized floating point represented by number as follows:

within the further standard It defines several special values:

The special value Explanation
0 Exponent part and the mantissa part are 1
gigantic Exponential part (maximum index), the mantissa part 0
NaN Exponential part (maximum index), the mantissa is not 0

Floating-point calculation

Floating-point multiply

Floating-point multiply divided into the following steps:

  • Calculating the sign bit: the symbol bit by XOR operation, if the two operands are the same as the sign bit, the result sign bit is 0, otherwise the result is a symbol 1
  • Calculate the original mantissa: mantissas of the two operands multiplied by the original mantissa
  • Calculate the original index: the index adding two operands to obtain the original index
  • Normalization and rounding: the original and the original index normalized mantissa, exponent of the result obtained, and then the mantissa rounding, results obtained mantissa
7241055-f7e58362a53f4578.png
mul_flow.png

Represents multiplication for scientific notation, it is:

Now consider the 32-bit single precision floating point (a float), which is an 8-bit exponent, the mantissa is 23 bits, to obtain the original and the original mantissa exponent:

  • Index Original: Original index of two 8-bit exponent sum total 9
  • Mantissa Original: Original mantissa multiplying two 23-bit mantissa, a total of 46

After obtaining the original normalized exponent and a mantissa, if the original index is less than -126, the range represents less than, the original mantissa right, each right one, the original index + 1, until the original index reaches -126, this time forming a non- normalized number. If the original mantissa is not less than -126, normal standardized:

  • Multiplying the two normalized numbers: results between 1 and 4, i.e., the highest two bits are the following possibilities:
    • Up to 2 01: 2 original mantissa shifted to the left (including removing implicit 1), the original index is directly obtained normalized index, the fractional part 44 left, the rounding processing section. If the original index -127 -2, the mantissa after shifting before adding 1, represented denormalized
    • Up to 2 10or 11: an original mantissa is shifted leftward (to remove implicit 1), to obtain the original index normalized index +1, 45 remaining fractional part, rounding processing section.
  • And multiplying the normalized number of non Specifications: result between 0 and 2, with the operation similar to the above
  • Denormalized operands denormalized numbers and multiplying: -252 original index, only 46-bit mantissa part, in any case impossible to normalize the index to -126, 0 directly

After normalization, the original index index is corrected, if the number of bits of mantissa case more than 23, also the need for rounding. Using the normalized mantissa represents, represents a high index of 23, after 24 represents the mantissa. Rounding using the "four off and six" approach, rounding rules are as follows:

  • When: abandoned, the result is rounded (rounded)
  • When: Carry rounding results (six up)
  • When: homes to even, even becomes an even number (is an odd number carry, otherwise it drops)

After rounding, the mantissa is corrected original mantissa multiplication is completed.

Floating point adder

The floating point adder divided into the following steps:

  • Of the order: the smaller exponent for floating-point mantissa shifted to the right, increasing the synchronization index, the index until the two operands, etc.
  • Sum: sum of mantissa
  • Normalization: the exponent and mantissa normalized to do, and the mantissa rounding
7241055-d8072731553b1a52.png
add_flow.png

For addition expressed in scientific notation, are:

The first step is to order, is about to become the same index in order to achieve addition, the provisions of a small order to big order to order that the original index, the index for smaller operands, you need to mantissa right shift, each shifted by one bit, index incremented by one, until the shift is equal to the order of the order is completed, the order of the process may be expressed as:

the second step is a summation, i.e. of the order is completed, the two mantissas can be directly original mantissa obtained by summing the summing process is as follows:

the third step of the rounding and normalization, the original mantissa, exponent original, normalized, and subjected to rounding operations to obtain a new exponent and mantissa, a multiplication operation the same, i.e., the completion of the floating point addition.

Guess you like

Origin blog.csdn.net/weixin_34087307/article/details/90995604