Memory Hierarchy
Memory Hierarchy¶
Overview¶
In computer systems, the speed gap between the CPU and memory is a major performance bottleneck. The memory hierarchy addresses this problem by optimizing the trade-offs between speed, capacity, and cost. In this lesson, we will learn about the principles of memory hierarchy, the concept of locality, and the characteristics of each memory level.
Difficulty: โญโญ
Prerequisites: Computer System Overview, CPU Architecture Basics
Table of Contents¶
- The Need for Memory Hierarchy
- Principle of Locality
- Memory Technology Comparison
- Memory Levels
- Memory Bandwidth and Latency
- Memory Performance Optimization
- Practice Problems
1. The Need for Memory Hierarchy¶
1.1 CPU-Memory Speed Gap¶
CPU and Memory Performance Trends (1980-2020):
Performance
โ
โ โโโโโโโโโโโ CPU
โ โ
โ โ
โ โ
โ โ
โ โ
โ โ โโโโโโโโโโ Memory
โ โ โ
โ โ โ
โ โ
โ โ
โ โ
โ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Year
1980 1990 2000 2010 2020
CPU: ~50% performance improvement per year (1980-2000)
Memory: ~7% performance improvement per year
Result: "Memory Wall" - Memory becomes the system performance bottleneck
1.2 Ideal Memory vs Reality¶
Ideal Memory:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ - Infinite capacity โ
โ - Zero access time โ
โ - Free โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Reality:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Speed โ = Cost โ, Capacity โ โ
โ Capacity โ = Cost โ, Speed โ โ
โ Cost โ = Speed โ, Limited capacity โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Solution: Memory Hierarchy
- Arrange multiple types of memory hierarchically
- Fast and small memory close to CPU
- Slow and large memory farther away
- Keep frequently used data in fast memory
1.3 Memory Hierarchy Concept¶
โโโโโโโโโโโโโ
โ Registers โ โ Fastest, smallest, most expensive
โ ~1KB โ
โโโโโโโฌโโโโโโ
โ
โโโโโโโดโโโโโโ
โ L1 Cache โ
โ ~64KB โ
โโโโโโโฌโโโโโโ
โ
โโโโโโโโโโโดโโโโโโโโโโ
โ L2 Cache โ
โ ~256KB-1MB โ
โโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
โ L3 Cache โ
โ ~2-32MB โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Main Memory โ
โ ~8-128GB โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SSD/HDD โ
โ ~256GB-10TB โ โ Slowest, largest, cheapest
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Going up: Faster, smaller, more expensive, closer to CPU
Going down: Slower, larger, cheaper, farther from CPU
2. Principle of Locality¶
2.1 What is Locality?¶
There are patterns in how programs access memory. This is called locality, and it is why memory hierarchy is effective.
2.2 Temporal Locality¶
Definition: Data accessed recently is likely to be accessed again in the near future
Example 1: Loop variables
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ for (i = 0; i < 1000; i++) { โ
โ sum += array[i]; โ
โ } โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Variables 'sum' and 'i' are accessed 1000 times
Example 2: Frequently called functions
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ while (running) { โ
โ process_event(); // repeated โ
โ update_state(); // repeated โ
โ } โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Function code is reused continuously
Exploiting temporal locality:
- Keep recently accessed data in cache
- Provide quickly from cache on re-access
2.3 Spatial Locality¶
Definition: Data near accessed data is likely to be accessed soon
Example 1: Sequential array access
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ for (i = 0; i < N; i++) { โ
โ sum += array[i]; โ
โ } โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Memory layout:
Address: 0x100 0x104 0x108 0x10C 0x110 0x114
โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโ
array: โ a[0] โ a[1] โ a[2] โ a[3] โ a[4] โ a[5] โ
โโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโ
โ โ โ โ โ โ
Sequential access - high spatial locality
Example 2: Struct access
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ struct Point { int x, y, z; }; โ
โ Point p; โ
โ distance = sqrt(p.x*p.x + โ
โ p.y*p.y + โ
โ p.z*p.z); โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
p.x, p.y, p.z are adjacent in memory
Exploiting spatial locality:
- Load data in cache line (block) units
- Fetch multiple adjacent data at once
- 64-byte cache line = 16 ints included
2.4 Optimization Using Locality¶
Good code (high locality):
// Row-major traversal - matches memory layout
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
sum += matrix[i][j];
}
}
Memory access pattern:
[0,0] [0,1] [0,2] [0,3] ... [1,0] [1,1] ...
โ โ โ โ โ โ
Sequential access โ High cache hit rate
Bad code (low locality):
// Column-major traversal - mismatches memory layout
for (j = 0; j < N; j++) {
for (i = 0; i < N; i++) {
sum += matrix[i][j];
}
}
Memory access pattern:
[0,0] [1,0] [2,0] [3,0] ... [0,1] [1,1] ...
โ โ โ โ โ โ
Jumping N elements โ Frequent cache misses
Performance difference (N=1024, 8-byte elements):
- Row-major: ~50ms
- Column-major: ~500ms (10x slower)
2.5 Working Set¶
Definition: The memory region actively used by a program during a specific time interval
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Working Set changes by program execution phase โ
โ โ
โ Phase 1: Initialization โ
โ Working Set: [init code] + [config data] โ
โ Size: ~100KB โ
โ โ
โ Phase 2: Data Processing โ
โ Working Set: [processing code] + [input buffer] + [output] โ
โ Size: ~10MB โ
โ โ
โ Phase 3: Result Output โ
โ Working Set: [output code] + [output buffer] โ
โ Size: ~500KB โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Cache size and Working Set:
- Working Set < Cache size: High hit rate
- Working Set > Cache size: Thrashing may occur
3. Memory Technology Comparison¶
3.1 SRAM (Static RAM)¶
SRAM Cell Structure (6T SRAM):
Vdd Vdd
โ โ
โโโโดโโโ โโโโดโโโ
โ P1 โ โ P2 โ
โโโโฌโโโ โโโโฌโโโ
โ โโโโโ โ
โโโโโโโโโโโโค โโโโโโโโโโโโโโค
โ โโโโโ โ
โโโโดโโโ โโโโดโโโ
โ N1 โ Word โ N2 โ
โโโโฌโโโ Line โโโโฌโโโ
โ โ โ
โ โโโโดโโโ โ
โ โ N3 โ โ
โ โโโโฌโโโ โ
โ โ โ
Bit GND ~Bit
Line Line
Characteristics:
- 6 transistors store 1 bit
- Data retained as long as power is supplied (no refresh needed)
- Fast access speed (~1-2ns)
- High cost, low density
- Usage: Cache memory, register files
3.2 DRAM (Dynamic RAM)¶
DRAM Cell Structure (1T1C DRAM):
Word Line
โ
โโโโดโโโ
โ T โ (transistor)
โโโโฌโโโ
โ
โโโโดโโโ
โ C โ (capacitor)
โโโโฌโโโ
โ
Bit Line
Characteristics:
- 1 transistor + 1 capacitor stores 1 bit
- Periodic refresh needed due to capacitor leakage (~64ms)
- Slow access speed (~50-100ns)
- Low cost, high density
- Usage: Main memory (DDR4, DDR5)
DRAM Access Process:
1. Row Activate (RAS): Select row, move data to sense amplifier
2. Column Read (CAS): Select column, output data
3. Precharge: Prepare for next access
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAS Latency โ CAS Latency โ Precharge โ = Total delay โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
~15ns ~15ns ~15ns ~45ns+
3.3 Flash Memory (SSD)¶
NAND Flash Structure:
Word Line 0 Word Line 1 Word Line 2
โ โ โ
โโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโ
โ โโโโดโโโ โโโโดโโโ โโโโดโโโ โ
โ โCell โ โCell โ โCell โ โ String 0
โ โโโโฌโโโ โโโโฌโโโ โโโโฌโโโ โ
โ โ โ โ โ
โ โโโโดโโโ โโโโดโโโ โโโโดโโโ โ
โ โCell โ โCell โ โCell โ โ String 1
โ โโโโฌโโโ โโโโฌโโโ โโโโฌโโโ โ
โโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโ
โ โ โ
Bit Line Bit Line Bit Line
Characteristics:
- Stores charge in floating gate
- Non-volatile (data retained without power)
- Read: ~25us, Write: ~250us, Erase: ~2ms
- Only block-level erase possible
- Usage: SSD, USB drives, SD cards
3.4 Memory Technology Comparison Table¶
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ Property โ SRAM โ DRAM โ NAND โ HDD โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Access Time โ 1-2ns โ 50-100ns โ 25us โ 5-10ms โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Cost/GB โ ~$5000 โ ~$3-5 โ ~$0.1 โ ~$0.02 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Volatile โ Yes โ Yes โ No โ No โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Typical Size โ ~32MB โ ~128GB โ ~4TB โ ~20TB โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Density โ Low โ High โ Very High โ Very High โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Power โ High โ Medium โ Low โ High โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Main Use โ Cache โMain Memoryโ SSD โMass Storageโ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโ
4. Memory Levels¶
4.1 Registers¶
Location: Inside CPU
Capacity: Tens to hundreds (x86-64: 16 general purpose + others)
Speed: 1 cycle (0.3ns @ 3GHz)
Register Types:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ General Purpose โ RAX, RBX, RCX, RDX, RSI, RDI, ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Program Counter โ RIP (Instruction Pointer) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Stack Pointer โ RSP, RBP โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Flag Register โ RFLAGS (Zero, Carry, Sign, ...) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Vector Registers โ XMM0-15, YMM0-15, ZMM0-31 (SIMD) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Floating Point โ ST0-ST7 (x87), XMM (SSE) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Register vs Memory Performance:
- Register operation: ADD R1, R2, R3 โ 1 cycle
- Memory operation: ADD R1, [mem] โ 4+ cycles (L1 hit)
4.2 Cache Memory¶
Cache Hierarchy (Modern CPU):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Core 0 โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ
โ โ โ L1 I-Cache โ โ L1 D-Cache โ โ โ
โ โ โ 32KB โ โ 32KB โ โ โ
โ โ โ 4 cycles โ โ 4 cycles โ โ โ
โ โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโฌโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โโโโโโโโโโโดโโโโโโโโโโ โ โ
โ โ โ L2 Cache โ โ โ
โ โ โ 256KB โ โ โ
โ โ โ 12 cycles โ โ โ
โ โ โโโโโโโโโโโฌโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ Core 1 โ
โ โ (same structure) โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโดโโโโโโโโโโโ โ
โ โ L3 Cache โ โ
โ โ 8-32MB (shared) โ โ
โ โ 40 cycles โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Main Memory โ
โ 100+ cycles โ
โโโโโโโโโโโโโโโโโโโ
Cache Characteristics Summary:
โโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level โ Capacity โ Latency โ Features โ
โโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ L1 โ 32-64KB โ 4 cycles โ Per-core, I/D split โ
โโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ L2 โ 256KB-1MBโ 12 cycles โ Per-core or shared โ
โโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ L3 โ 8-64MB โ 40 cycles โ Shared by all cores โ
โโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.3 Main Memory (DRAM)¶
Modern Memory System Structure:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Memory Controller โ
โ (integrated in CPU or northbridge) โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโดโโโโโ โโโโโโดโโโโโ โโโโโโดโโโโโ
โChannel 0โ โChannel 1โ โChannel 2โ
โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ
โ โ โ
โโโโโโดโโโโโ โโโโโโดโโโโโ โโโโโโดโโโโโ
โ DIMM 0 โ โ DIMM 0 โ โ DIMM 0 โ
โ 16GB โ โ 16GB โ โ 16GB โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
DDR (Double Data Rate) Evolution:
โโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ Gen โ Clock Speed โ BW/Channel โ Features โ
โโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ DDR3 โ 800-2133 โ 6.4-17GB/s โ 1.5V โ
โโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ DDR4 โ 1600-3200 โ 12.8-25.6GB/sโ 1.2V, bank groups โ
โโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ DDR5 โ 3200-6400+ โ 25.6-51.2GB/sโ 1.1V, on-die ECC โ
โโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
4.4 Secondary Storage¶
SSD Structure:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SSD Controller โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Flash Translation Layer (FTL) โ โ
โ โ - Logical to physical address translation โ โ
โ โ - Wear Leveling โ โ
โ โ - Garbage Collection โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ NAND โ โ NAND โ โ NAND โ โ NAND โ โ
โ โ Package โ โ Package โ โ Package โ โ Package โ โ
โ โ 0 โ โ 1 โ โ 2 โ โ 3 โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ NAND โ โ NAND โ โ NAND โ โ NAND โ โ
โ โ Package โ โ Package โ โ Package โ โ Package โ โ
โ โ 4 โ โ 5 โ โ 6 โ โ 7 โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
HDD vs SSD Comparison:
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ Property โ HDD โ SSD โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ Sequential Read โ 150 MB/s โ 500-7000 MB/s โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ Random Read โ 0.5 MB/s โ 50-500 MB/s โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ Latency โ 5-10 ms โ 0.02-0.1 ms โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ IOPS (Random) โ ~200 โ 10K-1M โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ Shock Resistanceโ Weak โ Strong โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ Price per GB โ $0.02 โ $0.10 โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
5. Memory Bandwidth and Latency¶
5.1 Bandwidth¶
Definition: Amount of data that can be transferred per unit time (bytes/second)
Calculation:
Bandwidth = Bus Width ร Transfer Rate
Example (DDR4-3200):
- Bus width: 64 bits = 8 bytes
- Transfer rate: 3200 MT/s (Mega Transfers per second)
- Bandwidth = 8 ร 3200 = 25,600 MB/s = 25.6 GB/s (per channel)
Dual channel: 25.6 ร 2 = 51.2 GB/s
Quad channel: 25.6 ร 4 = 102.4 GB/s
Memory Bandwidth by System:
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ System โ Bandwidth โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Desktop (DDR4) โ 25-50 GB/s โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโค
โ High-end Workstation โ 100-200 GB/s โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Server (8-ch DDR5) โ 300-400 GB/s โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโค
โ GPU (HBM2) โ 1-2 TB/s โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโ
5.2 Latency¶
Definition: Time from data request to receipt
Latency by Memory Level:
Level โ Latency (cycles) โ Latency (ns) โ Relative Cost
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโ
Registers โ 1 โ ~0.3 โ 1x
L1 Cache โ 4-5 โ ~1.5 โ 4x
L2 Cache โ 12-14 โ ~4 โ 12x
L3 Cache โ 40-50 โ ~15 โ 40x
Main Memory โ 100-300 โ ~60-100 โ 200x
SSD โ 10,000-50,000 โ 25-100us โ 50,000x
HDD โ 10,000,000+ โ 5-10ms โ 20,000,000x
Visualization:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ L1: โ โ
โ L2: โโโ โ
โ L3: โโโโโโโโโโโโโ โ
โ RAM: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SSD: [...] (too long to display) โ
โ HDD: [...] (even longer) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5.3 Latency vs Bandwidth Relationship¶
Little's Law:
Concurrent Requests = Bandwidth ร Latency
Meaning for Memory Systems:
- Multiple concurrent requests needed to utilize bandwidth
- Longer latency requires more concurrent requests
Example:
- Bandwidth: 50 GB/s
- Latency: 100 ns
- Cache line: 64 bytes
Required concurrent requests:
= (50 ร 10^9) ร (100 ร 10^-9) / 64
= 78 concurrent requests
Solutions:
- Out-of-order execution to issue multiple loads simultaneously
- Prefetching to request data in advance
- Memory Level Parallelism (MLP) utilization
5.4 AMAT (Average Memory Access Time)¶
Formula:
AMAT = Hit Time + (Miss Rate ร Miss Penalty)
Example:
- L1 hit time: 4 cycles
- L1 miss rate: 5%
- L2 hit time: 12 cycles
- L2 miss rate: 20% (of L1 misses)
- Memory access time: 200 cycles
Calculation:
AMAT = 4 + 0.05 ร (12 + 0.20 ร 200)
= 4 + 0.05 ร (12 + 40)
= 4 + 0.05 ร 52
= 4 + 2.6
= 6.6 cycles
AMAT for Multi-level Cache:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AMAT = T_L1 + MR_L1 ร (T_L2 + MR_L2 ร (T_L3 + MR_L3 ร T_Mem))โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
6. Memory Performance Optimization¶
6.1 Data Layout Optimization¶
Structure of Arrays (SoA) vs Array of Structures (AoS):
AoS (Array of Structures):
struct Particle {
float x, y, z; // 12 bytes
float vx, vy, vz; // 12 bytes
float mass; // 4 bytes
int id; // 4 bytes
}; // Total 32 bytes
Particle particles[1000];
Memory layout:
[x,y,z,vx,vy,vz,mass,id][x,y,z,vx,vy,vz,mass,id][...]
When accessing only x coordinates: Only 4 bytes used per 32 bytes โ wasteful
SoA (Structure of Arrays):
struct Particles {
float x[1000];
float y[1000];
float z[1000];
float vx[1000];
float vy[1000];
float vz[1000];
float mass[1000];
int id[1000];
};
Memory layout:
[x0,x1,x2,...][y0,y1,y2,...][z0,z1,z2,...]...
When accessing only x coordinates: Contiguous memory access โ efficient
Performance difference: SoA can be up to 4-8x faster (with SIMD)
6.2 Loop Optimization¶
Loop Tiling (Blocking):
// Original code - low cache efficiency
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
// With tiling - high cache efficiency
#define BLOCK 64
for (ii = 0; ii < N; ii += BLOCK)
for (jj = 0; jj < N; jj += BLOCK)
for (kk = 0; kk < N; kk += BLOCK)
for (i = ii; i < min(ii+BLOCK, N); i++)
for (j = jj; j < min(jj+BLOCK, N); j++)
for (k = kk; k < min(kk+BLOCK, N); k++)
C[i][j] += A[i][k] * B[k][j];
Effect:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Original: Repeatedly accessing entire large matrix โ
โ โ Cache thrashing โ
โ Tiled: Completely process small blocks โ Keep in cache โ
โ โ
โ N=1024, BLOCK=64: โ
โ - Original: ~10 seconds โ
โ - Tiled: ~2 seconds (5x improvement) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
6.3 Prefetching¶
Hardware Prefetch:
- CPU detects access patterns and automatically prefetches
- Effective for sequential and strided access
- Ineffective for irregular access
Software Prefetch:
// Intel intrinsic example
for (i = 0; i < N; i++) {
_mm_prefetch(&array[i + 16], _MM_HINT_T0); // Prefetch to L1
sum += array[i];
}
Prefetch Distance Calculation:
Distance = Memory Latency / Time per Loop Iteration
Example:
- Memory latency: 100 cycles
- Time per loop iteration: 5 cycles
- Prefetch distance: 100 / 5 = 20 elements ahead
Cautions:
- Too early prefetch: Evicted from cache
- Too late prefetch: Data not arrived
- Unnecessary prefetch: Cache pollution
6.4 Memory Alignment¶
Importance of Alignment:
Aligned access:
Address: 0x1000 (64-byte aligned)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 64-byte cache line โ
โ [All 64 bytes of data fit in one line] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Access count: 1
Unaligned access:
Address: 0x1020 (32-byte offset)
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cache Line 1 โ Cache Line 2 โ
โ [...32 bytes...] โ [...32 bytes...] โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
Access count: 2
Alignment Directives:
// C/C++
struct alignas(64) CacheLine {
int data[16];
};
// Dynamic allocation
void* ptr = aligned_alloc(64, size);
7. Practice Problems¶
Basic Problems¶
-
Explain why memory hierarchy is necessary.
-
Identify temporal and spatial locality in the following code:
c for (i = 0; i < 100; i++) { sum += array[i]; } -
What are 3 major differences between SRAM and DRAM?
Intermediate Problems¶
-
If L1 cache hit time is 4 cycles, miss rate is 8%, and L2 access time is 12 cycles, what is the AMAT?
-
Which code harms spatial locality? ```c // (a) for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[i][j];
// (b) for (j = 0; j < N; j++) for (i = 0; i < N; i++) sum += a[i][j]; ```
- What is the theoretical maximum bandwidth of DDR4-3200 dual channel?
Advanced Problems¶
-
Explain the problems and solutions when Working Set size exceeds cache size.
-
Apply loop tiling to optimize the following matrix multiplication code:
c for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][j] += A[i][k] * B[k][j]; -
Calculate the prefetch distance:
- Memory latency: 60ns
- CPU clock: 3GHz
- Instructions per loop: 10, CPI: 1
Answers
1. To bridge the speed gap between CPU and memory. By placing fast, expensive memory close to the CPU and slow, cheap memory farther away, we optimize performance relative to cost. 2. - Temporal locality: Variables `sum`, `i` (accessed 100 times) - Spatial locality: `array[i]` (accessing contiguous memory addresses) 3. SRAM vs DRAM: - Structure: SRAM 6T, DRAM 1T1C - Refresh: SRAM not needed, DRAM needed - Speed: SRAM fast (~2ns), DRAM slow (~50ns) - Cost/Density: SRAM expensive/low, DRAM cheap/high 4. AMAT = 4 + 0.08 ร 12 = 4 + 0.96 = 4.96 cycles 5. (b) Column-major traversal - Arrays are stored row-major in C/C++ 6. DDR4-3200: 8 bytes ร 3200MT/s ร 2 channels = 51.2 GB/s 7. Cache thrashing occurs - needed data is repeatedly replaced Solutions: Loop tiling, data layout optimization, algorithm changes 8. With tiling applied: ```c #define B 64 for (ii = 0; ii < N; ii += B) for (jj = 0; jj < N; jj += B) for (kk = 0; kk < N; kk += B) for (i = ii; i < ii+B && i < N; i++) for (j = jj; j < jj+B && j < N; j++) for (k = kk; k < kk+B && k < N; k++) C[i][j] += A[i][k] * B[k][j]; ``` 9. Prefetch distance: - Memory latency: 60ns ร 3GHz = 180 cycles - Loop time: 10 instructions ร 1 CPI = 10 cycles - Distance: 180 / 10 = 18 elements aheadNext Steps¶
- 15_Cache_Memory.md - Cache mapping, replacement policies, write policies
References¶
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
- What Every Programmer Should Know About Memory (Ulrich Drepper)
- Intel Memory Latency Checker
- Memory hierarchy visualization