Integer and Floating-Point Representation
Integer and Floating-Point Representation¶
Overview¶
Computer representation of numbers is divided into integers and real numbers. In this lesson, we'll learn about various representations of signed integers (sign-magnitude, one's complement, two's complement) and the IEEE 754 floating-point standard. We'll also cover integer overflow and floating-point precision issues.
Difficulty: ββ (Intermediate)
Table of Contents¶
- Integer Representation Overview
- Sign-Magnitude Representation
- One's Complement Representation
- Two's Complement Representation
- Integer Overflow
- Fixed-Point Representation
- IEEE 754 Floating-Point
- Floating-Point Precision Issues
- Practice Problems
1. Integer Representation Overview¶
Unsigned Integer¶
n-bit unsigned integer:
- Range: 0 ~ 2βΏ - 1
- All bits used to represent magnitude
8-bit example:
βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ
β 2β· β 2βΆ β 2β΅ β 2β΄ β 2Β³ β 2Β² β 2ΒΉ β 2β° β
β 128 β 64 β 32 β 16 β 8 β 4 β 2 β 1 β
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
00000000 = 0
11111111 = 255
Range: 0 ~ 255 (256 values)
Signed Integer Representation Methods¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Signed Integer Representation Comparison β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ€
β Method β Characteristics β
βββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β Sign- β MSB is sign, rest is magnitude β
β Magnitude β Both +0 and -0 exist, complex arithmetic β
βββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β One's β Negative = bitwise invert of positive β
β Complement β Both +0 and -0 exist β
βββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β Two's β Negative = one's complement + 1 β
β Complement β Unique zero, efficient arithmetic (modern β
β β standard) β
βββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
2. Sign-Magnitude Representation¶
Structure¶
Sign-Magnitude Representation:
In n bits:
βββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β Sign β Magnitude β
β (1 bit) β (n-1 bits) β
βββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββ
β
0 = positive
1 = negative
8-bit example:
+45 = 0 0101101
β βββββββ
sign magnitude(45)
-45 = 1 0101101
β βββββββ
sign magnitude(45)
Advantages and Disadvantages¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sign-Magnitude Representation Characteristics β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Advantages: β
β - Intuitive and easy to understand β
β - Simple sign checking (just check MSB) β
β - Easy absolute value calculation β
β β
β Disadvantages: β
β - Two zeros: +0 (00000000) and -0 (10000000) β
β - Complex addition/subtraction (sign comparison required) β
β - Complex hardware implementation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β n-bit range: -(2^(n-1) - 1) ~ +(2^(n-1) - 1) β
β 8-bit range: -127 ~ +127 (255 values, two zeros) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8-Bit Sign-Magnitude Examples¶
ββββββββββββ¬βββββββββββββ¬βββββββββββ
β Decimal β Binary β Note β
ββββββββββββΌβββββββββββββΌβββββββββββ€
β +127 β 01111111 β Maximum β
β +45 β 00101101 β β
β +1 β 00000001 β β
β +0 β 00000000 β Positive β
β -0 β 10000000 β Negative β
β -1 β 10000001 β β
β -45 β 10101101 β β
β -127 β 11111111 β Minimum β
ββββββββββββ΄βββββββββββββ΄βββββββββββ
3. One's Complement Representation¶
Structure¶
One's Complement Representation:
Positive: represented as is
Negative: all bits inverted
8-bit example:
+45 = 00101101
-45 = 11010010 (all bits inverted)
ββββββββ
00101101 β 11010010
Bit Inversion Process¶
Positive β Negative conversion:
+45 = 0 0 1 0 1 1 0 1
β β β β β β β β (invert each bit)
-45 = 1 1 0 1 0 0 1 0
Verification:
00101101 = 45
+ 11010010 = 210
ββββββββββββ
11111111 = 255 = 2βΈ - 1 β
One's complement property: N + (-N) = 2βΏ - 1
One's Complement Addition¶
One's complement addition feature: End-around carry
Example: 5 + (-3) in 4 bits
5 = 0101
-3 = 1100 (one's complement of 3)
0 1 0 1 (+5)
+ 1 1 0 0 (-3)
βββββββββ
1 0 0 0 1
β
End-around carry (add it back)
0 0 0 1
+ 1
βββββββββ
0 0 1 0 = +2 β
Zero Representation Problem¶
Zero representation in one's complement:
+0 = 00000000
-0 = 11111111 (one's complement of 0)
Both values mean zero but have different bit patterns
β Requires additional logic for comparisons and branching
4. Two's Complement Representation¶
Structure¶
Two's Complement Representation:
Positive: represented as is
Negative: one's complement + 1
8-bit example:
+45 = 00101101
-45 calculation:
1) One's complement: 11010010
2) +1: 11010011 β two's complement of -45
Two's Complement Calculation Methods¶
Method 1: One's complement + 1
45 = 00101101
ββββββββ invert
11010010
+ 1
ββββββββ
-45 = 11010011
Method 2: Keep from right up to first 1, invert rest
45 = 0 0 1 0 1 1 0 1
βββββ keep
1 1 0 1 0 1 0 1
βββββββ invert
-45 = 1 1 0 1 0 0 1 1
Method 3: 2βΏ - N
-45 = 256 - 45 = 211 = 11010011
Advantages of Two's Complement¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Advantages of Two's Complement β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Unique representation of zero β
β Only 00000000 represents 0 β
β -0 = two's complement(00000000) = 00000000 (itself) β
β β
β 2. Addition/subtraction use same operation β
β A - B = A + (-B) = A + (two's complement of B) β
β β
β 3. No end-around carry needed β
β Simply discard MSB carry β
β β
β 4. Simple hardware implementation β
β Single adder handles both addition and subtraction β
β β
β 5. Asymmetric range (one more negative) β
β n bits: -2^(n-1) ~ +2^(n-1) - 1 β
β 8 bits: -128 ~ +127 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Two's Complement Addition/Subtraction¶
Example 1: Positive + Positive (8-bit)
45 + 30 = 75
00101101 (+45)
+ 00011110 (+30)
ββββββββββ
01001011 (+75) β
Example 2: Positive + Negative
45 + (-30) = 15
Two's complement of -30: 11100010
00101101 (+45)
+ 11100010 (-30)
ββββββββββ
1 00001111
β
Discard carry
Result: 00001111 = +15 β
Example 3: Negative + Negative
-45 + (-30) = -75
-45 = 11010011
-30 = 11100010
11010011 (-45)
+ 11100010 (-30)
ββββββββββ
1 10110101
β
Discard carry
Result: 10110101
Convert to positive: 01001011 = 75
Therefore: -75 β
8-Bit Two's Complement Table¶
ββββββββββββ¬βββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β Decimal β Binary β Description β
ββββββββββββΌβββββββββββββΌβββββββββββββββββββββββββββββββββββββ€
β +127 β 01111111 β Maximum positive β
β +126 β 01111110 β β
β ... β ... β β
β +2 β 00000010 β β
β +1 β 00000001 β β
β 0 β 00000000 β Unique zero β
β -1 β 11111111 β All bits are 1 β
β -2 β 11111110 β β
β ... β ... β β
β -127 β 10000001 β β
β -128 β 10000000 β Minimum negative (asymmetric) β
ββββββββββββ΄βββββββββββββ΄βββββββββββββββββββββββββββββββββββββ
Two's complement number line:
-128 -127 -1 0 1 126 127
β β β β β β β
βββββββ΄ββββ ββββββ΄ββββ΄ββββ΄ββββββββββββ΄βββββ
10000000 11111111 00000000 00000001 01111111
Comparison of Three Methods¶
4-bit representation comparison:
ββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ
βDecimal βSign-Mag. β1's Comp. β2's Comp. β
ββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββ€
β +7 β 0111 β 0111 β 0111 β
β +6 β 0110 β 0110 β 0110 β
β +5 β 0101 β 0101 β 0101 β
β +4 β 0100 β 0100 β 0100 β
β +3 β 0011 β 0011 β 0011 β
β +2 β 0010 β 0010 β 0010 β
β +1 β 0001 β 0001 β 0001 β
β +0 β 0000 β 0000 β 0000 β
β -0 β 1000 β 1111 β N/A β
β -1 β 1001 β 1110 β 1111 β
β -2 β 1010 β 1101 β 1110 β
β -3 β 1011 β 1100 β 1101 β
β -4 β 1100 β 1011 β 1100 β
β -5 β 1101 β 1010 β 1011 β
β -6 β 1110 β 1001 β 1010 β
β -7 β 1111 β 1000 β 1001 β
β -8 β N/A β N/A β 1000 β
ββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ
Ranges:
Sign-magnitude: -7 ~ +7 (15 values, two zeros)
One's complement: -7 ~ +7 (15 values, two zeros)
Two's complement: -8 ~ +7 (16 values, one zero)
5. Integer Overflow¶
What is Overflow?¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Overflow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β When operation result exceeds representable range β
β β
β 8-bit two's complement example: β
β - Range: -128 ~ +127 β
β - 127 + 1 = ? (exceeds range!) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Positive overflow:
01111111 (+127)
+ 00000001 (+1)
ββββββββββ
10000000 (-128) β Unexpected negative!
Negative overflow (underflow):
10000000 (-128)
- 00000001 (+1)
ββββββββββ
01111111 (+127) β Unexpected positive!
Overflow Detection¶
Overflow detection conditions for signed integers:
1. Positive + Positive = Negative β Overflow
2. Negative + Negative = Positive β Overflow
3. Positive + Negative β No overflow possible
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Overflow Detection (V flag) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β V = C_in(MSB) XOR C_out(MSB) β
β β
β C_in: Carry input to MSB β
β C_out: Carry output from MSB β
β β
β V = 1 β Overflow occurred β
β V = 0 β No overflow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Example: 127 + 1 (8-bit)
01111111
+ 00000001
ββββββββββ
C_in β 1 (carry from bit 6 to bit 7)
C_out β 0 (no carry out from bit 7)
V = 1 XOR 0 = 1 β Overflow!
Overflow in Programming¶
// C language example
#include <stdio.h>
#include <limits.h>
int main() {
// Signed integer overflow (undefined behavior!)
int max = INT_MAX; // 2147483647
printf("%d + 1 = %d\n", max, max + 1); // Unpredictable
// Unsigned integer overflow (defined behavior, wraps around)
unsigned int umax = UINT_MAX; // 4294967295
printf("%u + 1 = %u\n", umax, umax + 1); // 0
return 0;
}
# Python example - arbitrary precision integers
a = 2**63 - 1 # 9223372036854775807
b = a + 1 # 9223372036854775808 (no overflow!)
# Python automatically handles large integers
huge = 10**100
print(huge) # Prints normally
Overflow Prevention Strategies¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Overflow Prevention Strategies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Check before operation β
β if (a > 0 && b > INT_MAX - a) β Overflow will occur β
β β
β 2. Use larger data types β
β long instead of int, long long instead of long β
β β
β 3. Use arbitrary precision libraries β
β Python, Java BigInteger, GMP, etc. β
β β
β 4. Use compiler options β
β -ftrapv (GCC): Halt program on overflow β
β β
β 5. Safe integer arithmetic libraries β
β SafeInt (C++), checked_add (Rust) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6. Fixed-Point Representation¶
Fixed-Point Concept¶
Fixed-Point:
Representation with fixed decimal point position
Q notation: Qm.n
- m: Number of integer bits
- n: Number of fractional bits
- Total bits = m + n (excluding sign) or 1 + m + n (with sign)
8-bit Q3.4 example (unsigned):
βββββββββββββββββββββββββββββββββββββββββ
β Integer (3 bits) β Fraction (4 bits)β
βββββββββββββββββββββββββββββββββββββββββ
2Β² 2ΒΉ 2β° . 2β»ΒΉ 2β»Β² 2β»Β³ 2β»β΄
4 2 1 . 0.5 0.25 0.125 0.0625
Example: 01011010 in Q3.4
= 0Γ4 + 1Γ2 + 0Γ1 + 1Γ0.5 + 1Γ0.25 + 0Γ0.125 + 1Γ0.0625 + 0Γ0.03125
= 2 + 0.5 + 0.25 + 0.0625
= 2.8125
Fixed-Point Operations¶
Fixed-point addition/subtraction:
- Same Q format allows integer-like operations
Fixed-point multiplication:
- Result fractional bits = sum of operand fractional bits
- Requires increased result bits to prevent overflow
Q4.4 Γ Q4.4 = Q8.8 (8-bit Γ 8-bit = 16-bit)
Example:
2.5 Γ 1.5 (Q4.4)
2.5 = 00101000 (0010.1000)
1.5 = 00011000 (0001.1000)
Multiplication result: 00101000 Γ 00011000 = 0000001011100000 (Q8.8)
= 3.75 (0000001111.00000000)
7. IEEE 754 Floating-Point¶
Floating-Point Concept¶
Floating-Point:
Representation with flexible decimal point position
Similar to scientific notation:
6.022 Γ 10Β²Β³ (Avogadro's number)
β β β
mantissa base exponent
Binary floating-point:
1.01011 Γ 2β΅
β β
mantissa exponent
IEEE 754 Format¶
IEEE 754 structure:
βββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β Sign β Exponent β Mantissa β
β (Sign) β (Exponent) β (Mantissa) β
β 1 bit β E bits β M bits β
βββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ
Value = (-1)^S Γ 1.M Γ 2^(E - bias)
βββββββββββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββ
β Format β Total Bitsβ Exponent(E)β Mantissa(M)β Bias β
βββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β Single β 32 β 8 β 23 β 127 β
β Precision β β β β β
βββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β Double β 64 β 11 β 52 β 1023 β
β Precision β β β β β
βββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β Half β 16 β 5 β 10 β 15 β
β Precision β β β β β
βββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββ
Single Precision (32-bit) Details¶
32-bit single precision (float):
31 30 23 22 0
βββββ¬βββββββββ¬ββββββββββββββββββββββββ
β S β E β M β
β 1 β 8 β 23 β
βββββ΄βββββββββ΄ββββββββββββββββββββββββ
Value = (-1)^S Γ 1.M Γ 2^(E - 127)
Example: Convert -6.625 to single precision
Step 1: Binary conversion
6 = 110β
0.625 = 0.101β
6.625 = 110.101β
Step 2: Normalize
110.101 = 1.10101 Γ 2Β²
Step 3: Determine each field
S = 1 (negative)
E = 2 + 127 = 129 = 10000001β
M = 10101000000000000000000 (after decimal point, padded to 23 bits)
Result: 1 10000001 10101000000000000000000
= 0xC0D40000
Double Precision (64-bit) Details¶
64-bit double precision (double):
63 62 52 51 0
βββββ¬ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β S β E β M β
β 1 β 11 β 52 β
βββββ΄ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββ
Value = (-1)^S Γ 1.M Γ 2^(E - 1023)
Range comparison:
βββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ
β Format β Range β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β Single β Β±1.18Γ10β»Β³βΈ ~ Β±3.4Γ10Β³βΈ β
β (float) β ~7 significant digits β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β Double β Β±2.23Γ10β»Β³β°βΈ ~ Β±1.8Γ10Β³β°βΈ β
β (double) β ~15-16 significant digits β
βββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
Special Values¶
IEEE 754 special values:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Special Value Representation β
βββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββββββββββββ€
β Value β Exponent(E) β Mantissa(M) β Meaning β
βββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββββββββββββ€
β +0 β 00000000 β 000...0 β Positive zero β
β -0 β 00000000 β 000...0 β Negative zero (S=1) β
βββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββββββββββββ€
β +β β 11111111 β 000...0 β Positive infinity β
β -β β 11111111 β 000...0 β Negative infinity(S=1)β
βββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββββββββββββ€
β NaN β 11111111 β β 0 β Not a Number β
β β β β (0/0, β-β, etc.) β
βββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββββββββββββ€
β Denorm β 00000000 β β 0 β Very small numbers β
β β β β 0.M Γ 2^(-126) β
βββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββββββ
NaN properties:
- NaN β NaN (not even equal to itself)
- All comparisons with NaN are false
- All operations with NaN result in NaN
Floating-Point Operations¶
Addition process:
1. Align exponents (adjust smaller to larger)
2. Add mantissas
3. Normalize
4. Round
Example: 1.5 + 0.25 (single precision)
1.5 = 1.1 Γ 2β°
0.25 = 1.0 Γ 2β»Β²
Align exponents:
0.25 β 0.01 Γ 2β°
Addition:
1.10
+ 0.01
ββββββ
1.11 Γ 2β° = 1.75 β
Multiplication process:
1. Determine sign (XOR)
2. Add exponents (adjust bias)
3. Multiply mantissas
4. Normalize
5. Round
Example: 1.5 Γ 2.0
1.5 = 1.1 Γ 2β°
2.0 = 1.0 Γ 2ΒΉ
Sign: positive Γ positive = positive
Exponent: 0 + 1 = 1
Mantissa: 1.1 Γ 1.0 = 1.1
Result: 1.1 Γ 2ΒΉ = 3.0 β
8. Floating-Point Precision Issues¶
Representation Errors¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Representation Error Problem β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Many decimal fractions cannot be exactly represented β
β in binary β
β β
β Example: 0.1 (decimal) β
β = 0.0001100110011001100110011... (binary, infinite repeat)β
β β
β Truncated when stored, causing error β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Exactly representable:
- Numbers expressible as sums of powers of 2
- Examples: 0.5 (2β»ΒΉ), 0.25 (2β»Β²), 0.75 (2β»ΒΉ + 2β»Β²)
Not exactly representable:
- Most decimal fractions
- Examples: 0.1, 0.2, 0.3, etc.
Code Examples¶
# Python example
>>> 0.1 + 0.2
0.30000000000000004
>>> 0.1 + 0.2 == 0.3
False
>>> format(0.1, '.20f')
'0.10000000000000000555'
>>> format(0.2, '.20f')
'0.20000000000000001110'
>>> format(0.1 + 0.2, '.20f')
'0.30000000000000004441'
// C example
#include <stdio.h>
int main() {
float a = 0.1f;
float sum = 0.0f;
for (int i = 0; i < 10; i++) {
sum += a;
}
printf("0.1 * 10 = %.10f\n", sum);
// Output: 0.1 * 10 = 1.0000001192
// Expected: 1.0000000000
return 0;
}
Precision Loss Situations¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Precision Loss Situations β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Adding large and small numbers β
β 1000000.0 + 0.0000001 β 1000000.0 β
β (small number outside significant digit range) β
β β
β 2. Subtracting similar-sized numbers β
β (Catastrophic Cancellation) β
β 1.0000001 - 1.0000000 = 0.0000001 β
β Significant digits drastically reduced β
β β
β 3. Repeated operations β
β Small errors accumulate to large errors β
β β
β 4. Infinite loop risk β
β for (float x = 0.0f; x != 1.0f; x += 0.1f) β
β β May never terminate! β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Solutions¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Solutions to Precision Problems β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Use epsilon tolerance β
β |a - b| < epsilon instead of a == b β
β β
β 2. Use integer arithmetic β
β Money: $1.99 β store as 199 cents β
β β
β 3. Use Decimal type β
β Python: from decimal import Decimal β
β Java: BigDecimal β
β β
β 4. Optimize operation order β
β Add small numbers first, then larger numbers β
β β
β 5. Kahan Summation (compensated summation) β
β Track and compensate for accumulated error β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Epsilon Comparison Examples¶
# Python - epsilon comparison
import math
def float_equals(a, b, rel_tol=1e-9, abs_tol=1e-9):
"""Floating-point comparison"""
return abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
# Or use math.isclose from Python 3.5+
print(math.isclose(0.1 + 0.2, 0.3)) # True
# Using Decimal
from decimal import Decimal, getcontext
getcontext().prec = 50 # Set precision
a = Decimal('0.1')
b = Decimal('0.2')
c = Decimal('0.3')
print(a + b == c) # True
// C - epsilon comparison
#include <math.h>
#include <float.h>
int float_equals(float a, float b) {
return fabs(a - b) < FLT_EPSILON * fmax(fabs(a), fabs(b));
}
int double_equals(double a, double b) {
return fabs(a - b) < DBL_EPSILON * fmax(fabs(a), fabs(b));
}
9. Practice Problems¶
Basic Problems¶
1. Represent the following numbers in 8-bit two's complement: - (a) +50 - (b) -50 - (c) -1 - (d) -128
2. Convert the following 8-bit two's complement values to decimal: - (a) 01100100 - (b) 11001110 - (c) 10000000 - (d) 11111111
3. Calculate the following operations in 8-bit two's complement: - (a) 45 + 30 - (b) 45 - 30 - (c) -45 - 30 - (d) -1 + 1
IEEE 754 Problems¶
4. Convert the following decimal numbers to IEEE 754 single precision (32-bit): - (a) 5.75 - (b) -0.375 - (c) 1.0
5. Convert the following IEEE 754 single precision bit patterns to decimal: - (a) 0 10000001 01000000000000000000000 - (b) 1 01111111 00000000000000000000000 - (c) 0 00000000 00000000000000000000000
Advanced Problems¶
6. Determine if overflow occurs in the following 8-bit two's complement operations: - (a) 01111111 + 00000001 - (b) 10000000 + 11111111 - (c) 01000000 + 01000000
7. Explain why 0.1 + 0.2 is not exactly 0.3.
8. Explain the Kahan Summation algorithm and its advantages compared to regular addition.
9. Explain why both +0 and -0 exist in IEEE 754 and their differences.
10. Identify the problem in the following code and suggest a solution:
float sum = 0.0f;
for (int i = 0; i < 1000000; i++) {
sum += 0.0001f;
}
printf("Expected: 100.0, Actual: %f\n", sum);
Answers
**1. 8-bit two's complement representation** - (a) +50 = 00110010 - (b) -50 = 11001110 (two's complement of 50) - (c) -1 = 11111111 - (d) -128 = 10000000 **2. Two's complement β Decimal** - (a) 01100100 = +100 (MSB is 0, so positive) - (b) 11001110 = -50 (MSB is 1, so negative; complement = 00110010 = 50) - (c) 10000000 = -128 - (d) 11111111 = -1 **3. Two's complement operations** - (a) 45 + 30 = 00101101 + 00011110 = 01001011 = +75 - (b) 45 - 30 = 00101101 + 11100010 = 00001111 = +15 - (c) -45 - 30 = 11010011 + 11100010 = 10110101 = -75 - (d) -1 + 1 = 11111111 + 00000001 = 00000000 = 0 **4. IEEE 754 single precision conversion** - (a) 5.75 = 101.11 = 1.0111 Γ 2Β² β 0 10000001 01110000000000000000000 - (b) -0.375 = -0.011 = -1.1 Γ 2β»Β² β 1 01111101 10000000000000000000000 - (c) 1.0 = 1.0 Γ 2β° β 0 01111111 00000000000000000000000 **5. IEEE 754 β Decimal** - (a) S=0, E=129, M=0.25 β +1.25 Γ 2Β² = +5.0 - (b) S=1, E=127, M=0 β -1.0 Γ 2β° = -1.0 - (c) S=0, E=0, M=0 β +0.0 **6. Overflow detection** - (a) 01111111 + 00000001 = 10000000 β Overflow! (positive+positive=negative) - (b) 10000000 + 11111111 = 01111111 β Overflow! (negative+negative=positive) - (c) 01000000 + 01000000 = 10000000 β Overflow! (64+64=-128) **7.** 0.1 and 0.2 cannot be exactly represented in binary floating-point. Both are infinite repeating fractions, so they're truncated when stored, and their errors add up. **8.** Kahan Summation tracks rounding error from each addition in a separate variable and compensates in the next addition. This minimizes error accumulation when adding many numbers. **9.** +0 and -0 have different bit patterns but compare as equal. -0 occurs when a very small negative number underflows, and preserves sign information: 1/+0 = +β, 1/-0 = -β. **10.** 0.0001f is not exactly representable in binary, causing errors that accumulate over 1 million iterations. Solutions: use double, Kahan Summation, or integer arithmetic followed by division.Next Steps¶
- 04_Logic_Gates.md - Basic logic gates and Boolean algebra
References¶
- IEEE 754-2019 Standard for Floating-Point Arithmetic
- What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg)
- Computer Organization and Design (Patterson & Hennessy)
- Float Toy - Interactive IEEE 754 Visualization
- Floating Point Guide