Published on

Float32, Float16 or BFloat16!

Authors

Why does that matter for Deep Learning?

Image

Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits. With Float32, we allocate the first bit to represent the sign, the next 8 bits to represent the exponent, and the next 23 bits to represent the decimal points (also called Mantissa). We can go from the bits representation to the decimal representation by using the simple formula:

Float32 = (-1)^sign * 2^(exponent - 127) * (1 + mantissa)

And this can range between -3.4e^38 and 3.4e^38.

Float16 uses 1 bit for the sign, 5 bits for the exponent, and 10 bits for the Mantissa with the formula:

Float16 = (-1)^sign * 2^(exponent - 15) * (1 + mantissa)

And the range is between -6.55e^4 and 6.55e^4 (so a much smaller range!). To convert from Float32 to Float16, you just need to remove the digits that cannot fit in the 5 and 10 bits allocated for the exponent and the Mantissa. For the Mantissa, you are just creating a rounding error, but if the Float32 number is greater than 6.55e^4, you will create a float overflow error! So, it is quite possible to get conversion errors from Float32 to Float16.

Brain Float 16 (BFloat16) is another float representation in 16 bits. We give less decimal precision but as much range as Float32. We have 8 bits for the exponent and 7 bits for the Mantissa with the same conversion formula:

BFloat16 = (-1)^sign * 2^(exponent - 127) * (1 + mantissa)

Giving the same range as the Float32 [-3.4e^38 and 3.4e^38]. So, converting to BFloat16 from Float32 is trivial because you just need to round down the Mantissa.

This is quite important for Deep Learning because, in the backpropagation algorithm, the model parameters are updated by a gradient descent optimizer (e.g., Adam), and the computations are done with Float32 precision to ensure fewer rounding errors. The model parameters and the gradients are usually stored in memory in Float16 to reduce the pressure on the memory, so we need to convert back and forth between Float16 and Float32. BFloat16 is a good choice because it prevents float overflow errors while keeping enough precision for the forward and backward passes of the backpropagation algorithm.

Author

AiUTOMATING PEOPLE, ABN ASIA was founded by people with deep roots in academia, with work experience in the US, Holland, Hungary, Japan, South Korea, Singapore, and Vietnam. ABN Asia is where academia and technology meet opportunity. With our cutting-edge solutions and competent software development services, we're helping businesses level up and take on the global scene. Our commitment: Faster. Better. More reliable. In most cases: Cheaper as well.

Feel free to reach out to us whenever you require IT services, digital consulting, off-the-shelf software solutions, or if you'd like to send us requests for proposals (RFPs). You can contact us at [email protected]. We're ready to assist you with all your technology needs.

ABNAsia.org

© ABN ASIA

AbnAsia.org Software