Finite precision representations

Finite precision representations#

Binary representation#

Humans use a base-10 numbering system called ‘decimal’, probably because that’s how many fingers we (typically) have.

Except for Mayans who used base-20: ‘vegesimal’ system! I guess they could only do math sitting down!

The number 1305 is expressed in decimal with each column indicating a power of the base (\(10\)): \(1305_{10} = 5 \times 10^0 + 0 \times 10^1 + 3 \times 10^2 + 1 \times 10^3\)

NB: We wrote the order of digits backwards so we could go in increasing powers.

Computers use base 2 (binary) since a bit can only be 0 or 1. The same number is written as:

# prompt: convert 1305 into binary

binary_repr(1305)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 3
      1 # prompt: convert 1305 into binary
----> 3 binary_repr(1305)

NameError: name 'binary_repr' is not defined

which we can check:

\(101000011001_2 = 1 \times 2^0 + 0 \times 2^1 + ... + 1 \times 2^3 + 1 \times 2^4+... + 1 \times 2^8 + 0 \times 2^{9} + 1 \times 2^{10}\) \(=1305_{10}\)

We can also use a decimal point with binary.

\(54.75_{10} = 5\times 10^{-2} + 7\times 10^{-1} + 4 \times 10^{0} + 5 \times 10^{1}\)

# prompt: express 54.75 in binary

from numpy import binary_repr

def decimal_to_binary(number):
  # Convert the integer part to binary
  integer_part = binary_repr(int(number))

  # Convert the fractional part to binary
  fractional_part = number - int(number)
  binary_fractional_part = ""
  for i in range(20):
      fractional_part *= 2
      if fractional_part >= 1:
          binary_fractional_part += "1"
          fractional_part -= 1
      else:
          binary_fractional_part += "0"
      if fractional_part == 0:
          break

# Combine the integer and fractional parts
  binary_representation = integer_part + "." + binary_fractional_part

  print(binary_representation)

decimal_to_binary(54.75)

110110.11

Check: \(1 \times 2^{-2} + 1 \times 2^{-1} + 0 \times 2^{0} + 1 \times 2^{1} + 1 \times 2^{2} + 0 \times 2^{3} + 1 \times 2^{4} + 1 \times 2^{5} = 54.75_{10}\)

Example Convert 0.1 to binary#

decimal_to_binary(0.1)

0.00011001100110011001

The binary representation of 0.1 is a repeating number!

Precision#

Computers use a standard data unit, called a word. The number of bits in each word is called the precision and is, by IEEE convention, in increments of 32:

Precision	# bits
single	32
double	64
quad	128

For comparison, the previous number 10100011001 takes 11 bits.

The most common precision in modern computing, and the standard in python3, is double precision. Quad precision is occasionally accessible for precise calculation.

Integers#

Integers are a fundamental data type if you don’t need fractions, and do not suffer from roundoff error! However, since they have a finite number of digits (bits) their size is limited.

The range of values an integer can store is \( 2^{bits}\). Integers are signed, so we must include if the number is +’ve or -‘ve. Furthermore, there is a redundancy where -0 = +0, leading to the range of -‘ves being larger than that of +’ve.

The min and max numbers an integer can represent is therefore:

\(min = -2^{bits-1}\)

\(max = 2^{bits-1} -1\)

You may be tempted to use a bit to represent the sign. This is not modern practice for integers, which instead use a method called Two’s complement.

Example: What is the largest integer a double precision variable can store?#

print('min ', -2**63, '\nmax  ', 2**63-1)

print('\nCheck with the built-in numpy examiner')
import numpy as np

print(np.iinfo(np.int64))

min  -9223372036854775808 
max   9223372036854775807

Check with the built-in numpy examiner
Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

Example 2: Let’s break it!#

# Works
print(np.int64(2**62))

#Overflow error
print(np.int64(2**63))

4611686018427387904

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[5], line 5
      2 print(np.int64(2**62))
      4 #Overflow error
----> 5 print(np.int64(2**63))

OverflowError: Python int too large to convert to C long

#Works
print(np.int64(1000000000000000000))

#Overflow error
print(np.int64(10000000000000000000))

1000000000000000000

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[6], line 5
      2 print(np.int64(1000000000000000000))
      4 #Overflow error
----> 5 print(np.int64(10000000000000000000))

OverflowError: Python int too large to convert to C long

#Floating point numbers

Writing a number like \(10 000 000 000 000 000 000\) isn’t really useful. It is much better to isolate the magnitude in units, or as an exponent:

\(10 000 000 000 000 000 000 = 10^{19}\)

Floating point Decimal numbers (aka: Engineering notation)#

Remove leading zeros and placeholder trailing zeros using a floating point to separate the fractional part (mantissa / significand) from the order of magnitude (exponent).

Engineering notation = \(mantisa \times 10^{exponent}\)

Decimal	Engineering	Mantissa	Exponent
\(265.73\)	\(2.6573 \times 10^2\)	2.6573	2
\(.0001\)	\(1 \times 10^{-4}\)	1	-4
\(-0.0034123\)	\(-3.4123 \times 10^{-3}\)	-3.4123	-3
\(1500^*\)	\(1.5 \times 10^3\)	1.5	3

*only if the trailing zeros are not actually measured.

Note:

The mantissa is a fraction, but if we normalize the fraction to have the decimal after the first digit, we can represent it as an integer.
The exponent is the power of the number system base, in this case \(10\).

# prompt: Convert a number to engineering notation

def to_engineering_notation(number):
  if number == 0:
    return "0"

  exponent = 0
  while abs(number) < 1:
    number *= 10
    exponent -= 1
  while abs(number) >= 10:
    number /= 10
    exponent += 1

  print(f"{number}E{exponent}")

to_engineering_notation(265.73)
to_engineering_notation(0.0001)
to_engineering_notation(-.0034123)
to_engineering_notation(1500)
to_engineering_notation(0)

# You can also use Log10 to calculate this

6573E2
0E-4
-3.4123E-3
5E3

'0'

Floating point Binary numbers#

The same technique can be applied to binary by using base 2:

\(mantisa \times 2^{exponent}\)

Example convert 54.75 into floating point binary#

\(54.75_{10} = 110110.11_2 \)

\(= 1.1011011_2 \times 2^5\)

\(= 1.1011011_2 \times 2^{101}\)

Precision in floating point numbers#

If the mantissa and exponent have infinite range, we can represent all numbers using floating point. However we are once again limited by the number of bits (precision). Now, the bits are divided into sign, mantissa, and exponent by convent: IEEE Standard for Floating-Point Arithmetic (IEEE 754):

Precision	# bits	Sign	Exponent	Mantissa
Single	1/8/23	S	EEEEEEEE	FFFFFFFFFFFFFFFFFFFFFFF
Double	1/11/52	S	EEEEEEEEEEE	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Quad	1/15/112	S	EEEEEEEEEEEEEEE	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Note that the ‘sign’ bit is now here and is the sign of the number. The sign of the exponent is one of the bits in the ‘exponent’ block.

The actual storage is a bit complicated, but the key for us is the finite precision of the mantissa.

Example How is 0.1 actually stored?#

print(format(0.1, '.55f'))

0.1000000000000000055511151231257827021181583404541015625

error = 0.0000000000000000055511151231257827021181583404541015625
eps_r = error / 0.1
print(eps_r)

5.551115123125783e-17

Summary#

In practice, we have to be careful when we mixing the order of terms. i.e. adding terms of different magnitude, or subtracting terms of slightly-varying magnitude.

We cannot count on the associative property:

print(-1+(1+1e-20))
print((-1+1)+1e-20)

0.0
1e-20

Beware of subtractive cancellation!

# Define two nearly equal numbers
a = np.float32(1.23456789)
b = np.float32(1.23456780)

# Perform subtraction
result = a - b

# Print the results with higher precision
print("a =", format(a, '.20f'))
print("b =", format(b, '.20f'))
print("a - b =", result)

a = 1.23456788063049316406
b = 1.23456776142120361328
a - b = 1.1920929e-07