Integer and Floating-Point Representation

Integer and Floating-Point Representation

Overview

Computer representation of numbers is divided into integers and real numbers. In this lesson, we'll learn about various representations of signed integers (sign-magnitude, one's complement, two's complement) and the IEEE 754 floating-point standard. We'll also cover integer overflow and floating-point precision issues.

Difficulty: ⭐⭐ (Intermediate)


Table of Contents

  1. Integer Representation Overview
  2. Sign-Magnitude Representation
  3. One's Complement Representation
  4. Two's Complement Representation
  5. Integer Overflow
  6. Fixed-Point Representation
  7. IEEE 754 Floating-Point
  8. Floating-Point Precision Issues
  9. Practice Problems

1. Integer Representation Overview

Unsigned Integer

n-bit unsigned integer:
- Range: 0 ~ 2ⁿ - 1
- All bits used to represent magnitude

8-bit example:
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚  2⁷ β”‚  2⁢ β”‚  2⁡ β”‚  2⁴ β”‚  2Β³ β”‚  2Β² β”‚  2ΒΉ β”‚  2⁰ β”‚
β”‚ 128 β”‚  64 β”‚  32 β”‚  16 β”‚  8  β”‚  4  β”‚  2  β”‚  1  β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

00000000 = 0
11111111 = 255
Range: 0 ~ 255 (256 values)

Signed Integer Representation Methods

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          Signed Integer Representation Comparison            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    Method     β”‚              Characteristics                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Sign-        β”‚  MSB is sign, rest is magnitude              β”‚
β”‚  Magnitude    β”‚  Both +0 and -0 exist, complex arithmetic    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  One's        β”‚  Negative = bitwise invert of positive       β”‚
β”‚  Complement   β”‚  Both +0 and -0 exist                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Two's        β”‚  Negative = one's complement + 1             β”‚
β”‚  Complement   β”‚  Unique zero, efficient arithmetic (modern   β”‚
β”‚               β”‚  standard)                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Sign-Magnitude Representation

Structure

Sign-Magnitude Representation:

In n bits:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Sign   β”‚              Magnitude                    β”‚
β”‚ (1 bit) β”‚              (n-1 bits)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
  0 = positive
  1 = negative

8-bit example:
+45 = 0 0101101
       ↑ β””β”€β”€β”€β”€β”€β”˜
     sign  magnitude(45)

-45 = 1 0101101
       ↑ β””β”€β”€β”€β”€β”€β”˜
     sign  magnitude(45)

Advantages and Disadvantages

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          Sign-Magnitude Representation Characteristics       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Advantages:                                                β”‚
β”‚  - Intuitive and easy to understand                         β”‚
β”‚  - Simple sign checking (just check MSB)                    β”‚
β”‚  - Easy absolute value calculation                          β”‚
β”‚                                                             β”‚
β”‚  Disadvantages:                                             β”‚
β”‚  - Two zeros: +0 (00000000) and -0 (10000000)              β”‚
β”‚  - Complex addition/subtraction (sign comparison required)  β”‚
β”‚  - Complex hardware implementation                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  n-bit range: -(2^(n-1) - 1) ~ +(2^(n-1) - 1)             β”‚
β”‚  8-bit range: -127 ~ +127 (255 values, two zeros)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

8-Bit Sign-Magnitude Examples

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Decimal  β”‚   Binary   β”‚  Note    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   +127   β”‚  01111111  β”‚ Maximum  β”‚
β”‚   +45    β”‚  00101101  β”‚          β”‚
β”‚   +1     β”‚  00000001  β”‚          β”‚
β”‚   +0     β”‚  00000000  β”‚ Positive β”‚
β”‚   -0     β”‚  10000000  β”‚ Negative β”‚
β”‚   -1     β”‚  10000001  β”‚          β”‚
β”‚   -45    β”‚  10101101  β”‚          β”‚
β”‚   -127   β”‚  11111111  β”‚ Minimum  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. One's Complement Representation

Structure

One's Complement Representation:

Positive: represented as is
Negative: all bits inverted

8-bit example:
+45 = 00101101

-45 = 11010010  (all bits inverted)
      ↓↓↓↓↓↓↓↓
      00101101 β†’ 11010010

Bit Inversion Process

Positive β†’ Negative conversion:

  +45 = 0 0 1 0 1 1 0 1
        ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓  (invert each bit)
  -45 = 1 1 0 1 0 0 1 0

Verification:
  00101101 = 45
+ 11010010 = 210
────────────
  11111111 = 255 = 2⁸ - 1 βœ“

One's complement property: N + (-N) = 2ⁿ - 1

One's Complement Addition

One's complement addition feature: End-around carry

Example: 5 + (-3) in 4 bits

  5 = 0101
 -3 = 1100 (one's complement of 3)

    0 1 0 1  (+5)
  + 1 1 0 0  (-3)
  ─────────
  1 0 0 0 1
  ↑
  End-around carry (add it back)

    0 0 0 1
  +       1
  ─────────
    0 0 1 0  = +2 βœ“

Zero Representation Problem

Zero representation in one's complement:

+0 = 00000000
-0 = 11111111 (one's complement of 0)

Both values mean zero but have different bit patterns
β†’ Requires additional logic for comparisons and branching

4. Two's Complement Representation

Structure

Two's Complement Representation:

Positive: represented as is
Negative: one's complement + 1

8-bit example:
+45 = 00101101

-45 calculation:
  1) One's complement: 11010010
  2) +1:              11010011  ← two's complement of -45

Two's Complement Calculation Methods

Method 1: One's complement + 1

  45 = 00101101
       ↓↓↓↓↓↓↓↓  invert
       11010010
             + 1
       ────────
 -45 = 11010011

Method 2: Keep from right up to first 1, invert rest

  45 = 0 0 1 0 1 1 0 1
              β””β”€β”€β”€β”˜ keep
       1 1 0 1 0 1 0 1
       β””β”€β”€β”€β”€β”€β”˜ invert

 -45 = 1 1 0 1 0 0 1 1

Method 3: 2ⁿ - N

 -45 = 256 - 45 = 211 = 11010011

Advantages of Two's Complement

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Advantages of Two's Complement                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  1. Unique representation of zero                           β”‚
β”‚     Only 00000000 represents 0                              β”‚
β”‚     -0 = two's complement(00000000) = 00000000 (itself)     β”‚
β”‚                                                             β”‚
β”‚  2. Addition/subtraction use same operation                 β”‚
β”‚     A - B = A + (-B) = A + (two's complement of B)          β”‚
β”‚                                                             β”‚
β”‚  3. No end-around carry needed                              β”‚
β”‚     Simply discard MSB carry                                β”‚
β”‚                                                             β”‚
β”‚  4. Simple hardware implementation                          β”‚
β”‚     Single adder handles both addition and subtraction      β”‚
β”‚                                                             β”‚
β”‚  5. Asymmetric range (one more negative)                    β”‚
β”‚     n bits: -2^(n-1) ~ +2^(n-1) - 1                         β”‚
β”‚     8 bits: -128 ~ +127                                     β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two's Complement Addition/Subtraction

Example 1: Positive + Positive (8-bit)
  45 + 30 = 75

    00101101  (+45)
  + 00011110  (+30)
  ──────────
    01001011  (+75) βœ“


Example 2: Positive + Negative
  45 + (-30) = 15

  Two's complement of -30: 11100010

    00101101  (+45)
  + 11100010  (-30)
  ──────────
  1 00001111
  ↑
  Discard carry

  Result: 00001111 = +15 βœ“


Example 3: Negative + Negative
  -45 + (-30) = -75

  -45 = 11010011
  -30 = 11100010

    11010011  (-45)
  + 11100010  (-30)
  ──────────
  1 10110101
  ↑
  Discard carry

  Result: 10110101
  Convert to positive: 01001011 = 75
  Therefore: -75 βœ“

8-Bit Two's Complement Table

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Decimal  β”‚   Binary   β”‚           Description              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   +127   β”‚  01111111  β”‚  Maximum positive                  β”‚
β”‚   +126   β”‚  01111110  β”‚                                    β”‚
β”‚    ...   β”‚    ...     β”‚                                    β”‚
β”‚    +2    β”‚  00000010  β”‚                                    β”‚
β”‚    +1    β”‚  00000001  β”‚                                    β”‚
β”‚     0    β”‚  00000000  β”‚  Unique zero                       β”‚
β”‚    -1    β”‚  11111111  β”‚  All bits are 1                    β”‚
β”‚    -2    β”‚  11111110  β”‚                                    β”‚
β”‚    ...   β”‚    ...     β”‚                                    β”‚
β”‚   -127   β”‚  10000001  β”‚                                    β”‚
β”‚   -128   β”‚  10000000  β”‚  Minimum negative (asymmetric)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two's complement number line:

-128  -127        -1   0   1         126  127
  β”‚     β”‚          β”‚   β”‚   β”‚           β”‚    β”‚
  └─────┴──── β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
10000000     11111111 00000000 00000001  01111111

Comparison of Three Methods

4-bit representation comparison:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Decimal β”‚Sign-Mag.   β”‚1's Comp.   β”‚2's Comp.   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   +7   β”‚   0111     β”‚   0111     β”‚   0111     β”‚
β”‚   +6   β”‚   0110     β”‚   0110     β”‚   0110     β”‚
β”‚   +5   β”‚   0101     β”‚   0101     β”‚   0101     β”‚
β”‚   +4   β”‚   0100     β”‚   0100     β”‚   0100     β”‚
β”‚   +3   β”‚   0011     β”‚   0011     β”‚   0011     β”‚
β”‚   +2   β”‚   0010     β”‚   0010     β”‚   0010     β”‚
β”‚   +1   β”‚   0001     β”‚   0001     β”‚   0001     β”‚
β”‚   +0   β”‚   0000     β”‚   0000     β”‚   0000     β”‚
β”‚   -0   β”‚   1000     β”‚   1111     β”‚   N/A      β”‚
β”‚   -1   β”‚   1001     β”‚   1110     β”‚   1111     β”‚
β”‚   -2   β”‚   1010     β”‚   1101     β”‚   1110     β”‚
β”‚   -3   β”‚   1011     β”‚   1100     β”‚   1101     β”‚
β”‚   -4   β”‚   1100     β”‚   1011     β”‚   1100     β”‚
β”‚   -5   β”‚   1101     β”‚   1010     β”‚   1011     β”‚
β”‚   -6   β”‚   1110     β”‚   1001     β”‚   1010     β”‚
β”‚   -7   β”‚   1111     β”‚   1000     β”‚   1001     β”‚
β”‚   -8   β”‚   N/A      β”‚   N/A      β”‚   1000     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Ranges:
  Sign-magnitude: -7 ~ +7 (15 values, two zeros)
  One's complement: -7 ~ +7 (15 values, two zeros)
  Two's complement: -8 ~ +7 (16 values, one zero)

5. Integer Overflow

What is Overflow?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Overflow                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  When operation result exceeds representable range           β”‚
β”‚                                                             β”‚
β”‚  8-bit two's complement example:                            β”‚
β”‚  - Range: -128 ~ +127                                       β”‚
β”‚  - 127 + 1 = ? (exceeds range!)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Positive overflow:
    01111111  (+127)
  + 00000001  (+1)
  ──────────
    10000000  (-128)  ← Unexpected negative!

Negative overflow (underflow):
    10000000  (-128)
  - 00000001  (+1)
  ──────────
    01111111  (+127)  ← Unexpected positive!

Overflow Detection

Overflow detection conditions for signed integers:

1. Positive + Positive = Negative β†’ Overflow
2. Negative + Negative = Positive β†’ Overflow
3. Positive + Negative β†’ No overflow possible

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Overflow Detection (V flag)                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  V = C_in(MSB) XOR C_out(MSB)                               β”‚
β”‚                                                             β”‚
β”‚  C_in:  Carry input to MSB                                  β”‚
β”‚  C_out: Carry output from MSB                               β”‚
β”‚                                                             β”‚
β”‚  V = 1 β†’ Overflow occurred                                  β”‚
β”‚  V = 0 β†’ No overflow                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example: 127 + 1 (8-bit)
      01111111
    + 00000001
    ──────────
C_in β†’  1 (carry from bit 6 to bit 7)
C_out β†’ 0 (no carry out from bit 7)

V = 1 XOR 0 = 1 β†’ Overflow!

Overflow in Programming

// C language example
#include <stdio.h>
#include <limits.h>

int main() {
    // Signed integer overflow (undefined behavior!)
    int max = INT_MAX;  // 2147483647
    printf("%d + 1 = %d\n", max, max + 1);  // Unpredictable

    // Unsigned integer overflow (defined behavior, wraps around)
    unsigned int umax = UINT_MAX;  // 4294967295
    printf("%u + 1 = %u\n", umax, umax + 1);  // 0

    return 0;
}
# Python example - arbitrary precision integers
a = 2**63 - 1  # 9223372036854775807
b = a + 1      # 9223372036854775808 (no overflow!)

# Python automatically handles large integers
huge = 10**100
print(huge)  # Prints normally

Overflow Prevention Strategies

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Overflow Prevention Strategies                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  1. Check before operation                                  β”‚
β”‚     if (a > 0 && b > INT_MAX - a) β†’ Overflow will occur    β”‚
β”‚                                                             β”‚
β”‚  2. Use larger data types                                   β”‚
β”‚     long instead of int, long long instead of long         β”‚
β”‚                                                             β”‚
β”‚  3. Use arbitrary precision libraries                       β”‚
β”‚     Python, Java BigInteger, GMP, etc.                     β”‚
β”‚                                                             β”‚
β”‚  4. Use compiler options                                    β”‚
β”‚     -ftrapv (GCC): Halt program on overflow                β”‚
β”‚                                                             β”‚
β”‚  5. Safe integer arithmetic libraries                       β”‚
β”‚     SafeInt (C++), checked_add (Rust)                      β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6. Fixed-Point Representation

Fixed-Point Concept

Fixed-Point:
Representation with fixed decimal point position

Q notation: Qm.n
- m: Number of integer bits
- n: Number of fractional bits
- Total bits = m + n (excluding sign) or 1 + m + n (with sign)

8-bit Q3.4 example (unsigned):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Integer (3 bits)  β”‚ Fraction (4 bits)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    2²  2¹  2⁰    .   2⁻¹ 2⁻² 2⁻³ 2⁻⁴
    4   2   1     .   0.5 0.25 0.125 0.0625

Example: 01011010 in Q3.4
= 0Γ—4 + 1Γ—2 + 0Γ—1 + 1Γ—0.5 + 1Γ—0.25 + 0Γ—0.125 + 1Γ—0.0625 + 0Γ—0.03125
= 2 + 0.5 + 0.25 + 0.0625
= 2.8125

Fixed-Point Operations

Fixed-point addition/subtraction:
- Same Q format allows integer-like operations

Fixed-point multiplication:
- Result fractional bits = sum of operand fractional bits
- Requires increased result bits to prevent overflow

Q4.4 Γ— Q4.4 = Q8.8 (8-bit Γ— 8-bit = 16-bit)

Example:
  2.5 Γ— 1.5 (Q4.4)
  2.5 = 00101000 (0010.1000)
  1.5 = 00011000 (0001.1000)

  Multiplication result: 00101000 Γ— 00011000 = 0000001011100000 (Q8.8)
  = 3.75 (0000001111.00000000)

7. IEEE 754 Floating-Point

Floating-Point Concept

Floating-Point:
Representation with flexible decimal point position

Similar to scientific notation:
6.022 Γ— 10Β²Β³ (Avogadro's number)
  ↑     ↑ ↑
mantissa base exponent

Binary floating-point:
1.01011 Γ— 2⁡
  ↑        ↑
mantissa exponent

IEEE 754 Format

IEEE 754 structure:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Sign   β”‚     Exponent       β”‚         Mantissa               β”‚
β”‚ (Sign)  β”‚    (Exponent)      β”‚         (Mantissa)            β”‚
β”‚ 1 bit   β”‚     E bits         β”‚          M bits               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Value = (-1)^S Γ— 1.M Γ— 2^(E - bias)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Format     β”‚ Total Bitsβ”‚ Exponent(E)β”‚ Mantissa(M)β”‚   Bias    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Single       β”‚    32     β”‚     8     β”‚    23     β”‚    127     β”‚
β”‚  Precision    β”‚           β”‚           β”‚           β”‚            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Double       β”‚    64     β”‚    11     β”‚    52     β”‚    1023    β”‚
β”‚  Precision    β”‚           β”‚           β”‚           β”‚            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Half         β”‚    16     β”‚     5     β”‚    10     β”‚     15     β”‚
β”‚  Precision    β”‚           β”‚           β”‚           β”‚            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Single Precision (32-bit) Details

32-bit single precision (float):

  31 30    23 22                    0
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ S β”‚   E    β”‚          M           β”‚
β”‚ 1 β”‚   8    β”‚         23           β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Value = (-1)^S Γ— 1.M Γ— 2^(E - 127)

Example: Convert -6.625 to single precision

Step 1: Binary conversion
  6 = 110β‚‚
  0.625 = 0.101β‚‚
  6.625 = 110.101β‚‚

Step 2: Normalize
  110.101 = 1.10101 Γ— 2Β²

Step 3: Determine each field
  S = 1 (negative)
  E = 2 + 127 = 129 = 10000001β‚‚
  M = 10101000000000000000000 (after decimal point, padded to 23 bits)

Result: 1 10000001 10101000000000000000000
     = 0xC0D40000

Double Precision (64-bit) Details

64-bit double precision (double):

  63 62         52 51                                  0
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ S β”‚      E      β”‚                  M                   β”‚
β”‚ 1 β”‚     11      β”‚                 52                   β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Value = (-1)^S Γ— 1.M Γ— 2^(E - 1023)

Range comparison:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Format    β”‚                   Range                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Single     β”‚  Β±1.18Γ—10⁻³⁸ ~ Β±3.4Γ—10³⁸                      β”‚
β”‚  (float)    β”‚  ~7 significant digits                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Double     β”‚  Β±2.23Γ—10⁻³⁰⁸ ~ Β±1.8Γ—10³⁰⁸                    β”‚
β”‚  (double)   β”‚  ~15-16 significant digits                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Special Values

IEEE 754 special values:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Special Value Representation                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Value  β”‚ Exponent(E)  β”‚ Mantissa(M) β”‚       Meaning          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  +0     β”‚  00000000    β”‚  000...0    β”‚  Positive zero         β”‚
β”‚  -0     β”‚  00000000    β”‚  000...0    β”‚  Negative zero (S=1)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  +∞     β”‚  11111111    β”‚  000...0    β”‚  Positive infinity     β”‚
β”‚  -∞     β”‚  11111111    β”‚  000...0    β”‚  Negative infinity(S=1)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  NaN    β”‚  11111111    β”‚  β‰  0       β”‚  Not a Number          β”‚
β”‚         β”‚              β”‚             β”‚  (0/0, ∞-∞, etc.)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Denorm  β”‚  00000000    β”‚  β‰  0       β”‚  Very small numbers    β”‚
β”‚         β”‚              β”‚             β”‚  0.M Γ— 2^(-126)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

NaN properties:
- NaN β‰  NaN (not even equal to itself)
- All comparisons with NaN are false
- All operations with NaN result in NaN

Floating-Point Operations

Addition process:
1. Align exponents (adjust smaller to larger)
2. Add mantissas
3. Normalize
4. Round

Example: 1.5 + 0.25 (single precision)

1.5  = 1.1 Γ— 2⁰
0.25 = 1.0 Γ— 2⁻²

Align exponents:
0.25 β†’ 0.01 Γ— 2⁰

Addition:
  1.10
+ 0.01
──────
  1.11 Γ— 2⁰ = 1.75 βœ“


Multiplication process:
1. Determine sign (XOR)
2. Add exponents (adjust bias)
3. Multiply mantissas
4. Normalize
5. Round

Example: 1.5 Γ— 2.0

1.5 = 1.1 Γ— 2⁰
2.0 = 1.0 Γ— 2ΒΉ

Sign: positive Γ— positive = positive
Exponent: 0 + 1 = 1
Mantissa: 1.1 Γ— 1.0 = 1.1

Result: 1.1 Γ— 2ΒΉ = 3.0 βœ“

8. Floating-Point Precision Issues

Representation Errors

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Representation Error Problem                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Many decimal fractions cannot be exactly represented        β”‚
β”‚  in binary                                                   β”‚
β”‚                                                             β”‚
β”‚  Example: 0.1 (decimal)                                     β”‚
β”‚  = 0.0001100110011001100110011... (binary, infinite repeat)β”‚
β”‚                                                             β”‚
β”‚  Truncated when stored, causing error                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Exactly representable:
- Numbers expressible as sums of powers of 2
- Examples: 0.5 (2⁻¹), 0.25 (2⁻²), 0.75 (2⁻¹ + 2⁻²)

Not exactly representable:
- Most decimal fractions
- Examples: 0.1, 0.2, 0.3, etc.

Code Examples

# Python example
>>> 0.1 + 0.2
0.30000000000000004

>>> 0.1 + 0.2 == 0.3
False

>>> format(0.1, '.20f')
'0.10000000000000000555'

>>> format(0.2, '.20f')
'0.20000000000000001110'

>>> format(0.1 + 0.2, '.20f')
'0.30000000000000004441'
// C example
#include <stdio.h>

int main() {
    float a = 0.1f;
    float sum = 0.0f;

    for (int i = 0; i < 10; i++) {
        sum += a;
    }

    printf("0.1 * 10 = %.10f\n", sum);
    // Output: 0.1 * 10 = 1.0000001192
    // Expected: 1.0000000000

    return 0;
}

Precision Loss Situations

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Precision Loss Situations                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  1. Adding large and small numbers                          β”‚
β”‚     1000000.0 + 0.0000001 β‰ˆ 1000000.0                      β”‚
β”‚     (small number outside significant digit range)          β”‚
β”‚                                                             β”‚
β”‚  2. Subtracting similar-sized numbers                       β”‚
β”‚     (Catastrophic Cancellation)                             β”‚
β”‚     1.0000001 - 1.0000000 = 0.0000001                      β”‚
β”‚     Significant digits drastically reduced                  β”‚
β”‚                                                             β”‚
β”‚  3. Repeated operations                                     β”‚
β”‚     Small errors accumulate to large errors                 β”‚
β”‚                                                             β”‚
β”‚  4. Infinite loop risk                                      β”‚
β”‚     for (float x = 0.0f; x != 1.0f; x += 0.1f)             β”‚
β”‚     β†’ May never terminate!                                  β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Solutions

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Solutions to Precision Problems                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  1. Use epsilon tolerance                                   β”‚
β”‚     |a - b| < epsilon  instead of  a == b                   β”‚
β”‚                                                             β”‚
β”‚  2. Use integer arithmetic                                  β”‚
β”‚     Money: $1.99 β†’ store as 199 cents                      β”‚
β”‚                                                             β”‚
β”‚  3. Use Decimal type                                        β”‚
β”‚     Python: from decimal import Decimal                    β”‚
β”‚     Java: BigDecimal                                        β”‚
β”‚                                                             β”‚
β”‚  4. Optimize operation order                                β”‚
β”‚     Add small numbers first, then larger numbers           β”‚
β”‚                                                             β”‚
β”‚  5. Kahan Summation (compensated summation)                 β”‚
β”‚     Track and compensate for accumulated error             β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Epsilon Comparison Examples

# Python - epsilon comparison
import math

def float_equals(a, b, rel_tol=1e-9, abs_tol=1e-9):
    """Floating-point comparison"""
    return abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)

# Or use math.isclose from Python 3.5+
print(math.isclose(0.1 + 0.2, 0.3))  # True

# Using Decimal
from decimal import Decimal, getcontext

getcontext().prec = 50  # Set precision

a = Decimal('0.1')
b = Decimal('0.2')
c = Decimal('0.3')

print(a + b == c)  # True
// C - epsilon comparison
#include <math.h>
#include <float.h>

int float_equals(float a, float b) {
    return fabs(a - b) < FLT_EPSILON * fmax(fabs(a), fabs(b));
}

int double_equals(double a, double b) {
    return fabs(a - b) < DBL_EPSILON * fmax(fabs(a), fabs(b));
}

9. Practice Problems

Basic Problems

1. Represent the following numbers in 8-bit two's complement: - (a) +50 - (b) -50 - (c) -1 - (d) -128

2. Convert the following 8-bit two's complement values to decimal: - (a) 01100100 - (b) 11001110 - (c) 10000000 - (d) 11111111

3. Calculate the following operations in 8-bit two's complement: - (a) 45 + 30 - (b) 45 - 30 - (c) -45 - 30 - (d) -1 + 1

IEEE 754 Problems

4. Convert the following decimal numbers to IEEE 754 single precision (32-bit): - (a) 5.75 - (b) -0.375 - (c) 1.0

5. Convert the following IEEE 754 single precision bit patterns to decimal: - (a) 0 10000001 01000000000000000000000 - (b) 1 01111111 00000000000000000000000 - (c) 0 00000000 00000000000000000000000

Advanced Problems

6. Determine if overflow occurs in the following 8-bit two's complement operations: - (a) 01111111 + 00000001 - (b) 10000000 + 11111111 - (c) 01000000 + 01000000

7. Explain why 0.1 + 0.2 is not exactly 0.3.

8. Explain the Kahan Summation algorithm and its advantages compared to regular addition.

9. Explain why both +0 and -0 exist in IEEE 754 and their differences.

10. Identify the problem in the following code and suggest a solution:

float sum = 0.0f;
for (int i = 0; i < 1000000; i++) {
    sum += 0.0001f;
}
printf("Expected: 100.0, Actual: %f\n", sum);

Answers **1. 8-bit two's complement representation** - (a) +50 = 00110010 - (b) -50 = 11001110 (two's complement of 50) - (c) -1 = 11111111 - (d) -128 = 10000000 **2. Two's complement β†’ Decimal** - (a) 01100100 = +100 (MSB is 0, so positive) - (b) 11001110 = -50 (MSB is 1, so negative; complement = 00110010 = 50) - (c) 10000000 = -128 - (d) 11111111 = -1 **3. Two's complement operations** - (a) 45 + 30 = 00101101 + 00011110 = 01001011 = +75 - (b) 45 - 30 = 00101101 + 11100010 = 00001111 = +15 - (c) -45 - 30 = 11010011 + 11100010 = 10110101 = -75 - (d) -1 + 1 = 11111111 + 00000001 = 00000000 = 0 **4. IEEE 754 single precision conversion** - (a) 5.75 = 101.11 = 1.0111 Γ— 2Β² β†’ 0 10000001 01110000000000000000000 - (b) -0.375 = -0.011 = -1.1 Γ— 2⁻² β†’ 1 01111101 10000000000000000000000 - (c) 1.0 = 1.0 Γ— 2⁰ β†’ 0 01111111 00000000000000000000000 **5. IEEE 754 β†’ Decimal** - (a) S=0, E=129, M=0.25 β†’ +1.25 Γ— 2Β² = +5.0 - (b) S=1, E=127, M=0 β†’ -1.0 Γ— 2⁰ = -1.0 - (c) S=0, E=0, M=0 β†’ +0.0 **6. Overflow detection** - (a) 01111111 + 00000001 = 10000000 β†’ Overflow! (positive+positive=negative) - (b) 10000000 + 11111111 = 01111111 β†’ Overflow! (negative+negative=positive) - (c) 01000000 + 01000000 = 10000000 β†’ Overflow! (64+64=-128) **7.** 0.1 and 0.2 cannot be exactly represented in binary floating-point. Both are infinite repeating fractions, so they're truncated when stored, and their errors add up. **8.** Kahan Summation tracks rounding error from each addition in a separate variable and compensates in the next addition. This minimizes error accumulation when adding many numbers. **9.** +0 and -0 have different bit patterns but compare as equal. -0 occurs when a very small negative number underflows, and preserves sign information: 1/+0 = +∞, 1/-0 = -∞. **10.** 0.0001f is not exactly representable in binary, causing errors that accumulate over 1 million iterations. Solutions: use double, Kahan Summation, or integer arithmetic followed by division.

Next Steps


References

to navigate between lessons