A brief comparison of integer and double representation The following slides are accurate for signed data types (can be both positive and negative) We can also make a primitive unsigned by using the unsigned keyword, which effectively increases their range in the positive direction by a power of 2 (1 bit) amwallis
integer 32 bits Range: −2,147,483,648 to 2,147,483,647 i.e. from −231 to 231 − 1 Stored using two’s complement (a special mathematical operation performed on binary numbers)
what is two’s complement? Two’s complement is a method of storing binary numbers that allows negative and positive numbers to be added without any special logic If the number is a positive number then just use that number’s binary representation - nothing needs to be changed If the number is negative, find the complement of it’s binary representation (invert 0's and 1's) Then add 1 to the complement
for example: the decimal number 42 It is represented in binary as: 42 = 1 = 25 + 23 + 21 = 32 + 8 + 2 Since it is positive, nothing needs to be changed – it’s already in it’s two’s complement form!
Now lets look at -42 To start, we take the number 42 in binary: 42 = 1 1 then we take it’s complement (flip the bits)
Now lets look at -42 then we add 1 + -42 in two’s complement form = 1 + 1 -42 in two’s complement form = 1
this makes binary addition very easy for example: 2 + (-1) 2 1 + (-1) 1 = 1 1 Discard the carry
two’s complement cont. Using two’s complement, the first bit will always indicate the sign 1 for negative numbers 0 for positive numbers Which leaves us with 31 bits to represent the value of the integer
double 64 bits Range: −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 But not all numbers in this range can be represented! Represented by: 1 sign bit 11 exponent bits 52 significand bits
double bits visualized For example, 5.5(10) is 101.1(2) in binary Converts to binary scientific notation: 1.011(2) × 22 *note: (10) indicates decimal and (2) binary
double cont. Floating-point representation can't represent all of the numbers in its range — this is impossible 64 bits can represent only 264 distinct values, and there are infinitely many real numbers in the range to represent.
double (8 bit example) To simplify things though – lets take just use only 8 bits to represent our floating point number (instead of the 64 in a double) We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents), and the last three bits for the mantissa's fractional part.
double (8 bit example) To determine the largest positive number we can represent We would want the sign bit to be 0 We would place 1 in all the exponent bits to get the largest exponent possible We would put 1 in all the mantissa bits. This gives us 0 1111 111, (in our 8 bit example): 1.111(2) × 215 − 7 = 1.111(2) × 28 = 111100000(2)=480(10)
double (8 bit example) Let's consider how to represent 51(10) in this scheme. In binary, this is 110011(2) = 1.10011(2) × 25. When we try to fit the mantissa into the 3-bit portion of our scheme, we find that the last two bits won't fit: We would be forced to round to 1.101(2) × 25, Resulting in a bit pattern of: 0 1100 101.
double (8 bit example) That rounding means that we're not representing the number precisely. In fact, 1 1100 101 translates to 1.101(2) × 212 − 7 = 1.101(2) × 25 = 110100(2) = 52(10) Thus, in our 8-bit floating-point representation, 51 equals 52!