Cache Memory
Cache Memory¶
Overview¶
Cache memory is a high-speed buffer memory located between the CPU and main memory to bridge the speed gap between them. Cache design directly affects computer performance and requires various design decisions including mapping schemes, replacement policies, and write policies. In this lesson, we will learn about cache operation principles and various design techniques.
Difficulty: βββ
Prerequisites: Memory Hierarchy, Principle of Locality
Table of Contents¶
- Cache Concept and Operation
- Cache Mapping Schemes
- Cache Replacement Policies
- Cache Write Policies
- Types of Cache Misses
- Multi-Level Cache
- Cache Optimization Techniques
- Practice Problems
1. Cache Concept and Operation¶
1.1 What is Cache?¶
Cache Etymology: French "cacher" (to hide)
Definition: Small, high-speed memory located between CPU and main memory
Purpose: Keep frequently used data accessible quickly
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Role of Cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CPU Cache Main Memory β
β βββββββββ βββββββββββββ βββββββββββββ β
β β β Fast β β Slow β β β
β β βββββββββββΆβ βββββββββββΆβ β β
β β β (~4 ns) β β (~60 ns) β β β
β βββββββββ βββββββββββββ βββββββββββββ β
β β
β Reduce average access time by keeping frequently used β
β data in cache β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Cache Structure Basics¶
Cache Line / Block:
- Basic unit of data transfer between cache and memory
- Typically 64 bytes (modern processors)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cache Line (64 bytes) β
βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββ€
β Valid β Tag β β
β (1b) β (upper address bits) β β
βββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββ€
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data (64 bytes) β β
β β 0x00 0x01 0x02 ... 0x3E 0x3F β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cache Terminology:
- Hit: Requested data is in cache
- Miss: Requested data is not in cache
- Hit Rate: Number of hits / Total accesses
- Miss Rate: 1 - Hit Rate
1.3 Address Decomposition¶
Decomposing memory address for cache access:
32-bit address example (4KB cache, 64B line, direct mapped):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 31 12 11 6 5 0 β
β βββββββββββββββββββββββββββΌββββββββββββββΌβββββββββββββββββ€ β
β β Tag (20b) β Index (6b) β Offset (6b) β β
β βββββββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Offset (6 bits):
- Position within cache line
- 64 bytes = 2^6, thus 6 bits
Index (6 bits):
- Cache line selection
- 4KB / 64B = 64 lines = 2^6, thus 6 bits
Tag (20 bits):
- Distinguishes different addresses with same index
- Remaining bits = 32 - 6 - 6 = 20 bits
1.4 Cache Operation Flow¶
Cache Read Operation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CPU Address Request β
β β β
β βΌ β
β ββββββββββββββββββββββββββ β
β β Select cache line by β β
β β Index β β
β ββββββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββ β
β β Compare Tag & β β
β β Check Valid β β
β ββββββββββββββ¬ββββββββββββ β
β β β
β ββββββββββββββ΄βββββββββββββ β
β β β β
β Tag match & Tag mismatch or β
β Valid = 1 Valid = 0 β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββ β
β β Cache Hit! β β Cache Miss! β β
β β Select/return β β Load block from β β
β β data by Offset β β memory, store in β β
β β β β cache, then return β β
β βββββββββββββββββββ βββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Cache Mapping Schemes¶
2.1 Direct Mapped¶
Characteristic: Each memory block can only be stored in one cache location
Calculation:
Cache Index = (Memory Block Address) mod (Number of Cache Lines)
Example: 8-line cache
Memory Block Cache Index
0 β 0
1 β 1
2 β 2
...
7 β 7
8 β 0 (conflict!)
9 β 1
...
16 β 0 (conflict!)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Direct Mapped Cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Index β Valid β Tag β Data (64B) β
ββββββββββΌββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ€
β 0 β 1 β 0x100 β [block 0 or 8 or 16...] β
β 1 β 1 β 0x101 β [block 1 or 9 or 17...] β
β 2 β 0 β - β - β
β 3 β 1 β 0x102 β [block 3 or 11...] β
β 4 β 1 β 0x103 β [...] β
β 5 β 0 β - β - β
β 6 β 1 β 0x100 β [...] β
β 7 β 1 β 0x101 β [...] β
ββββββββββ΄ββββββββ΄ββββββββββββ΄ββββββββββββββββββββββββββββββββ
Advantages:
- Simple hardware
- Fast access (no parallel tag comparison needed)
- Deterministic replacement (no replacement policy needed)
Disadvantages:
- Many conflict misses
- Blocks mapping to same index compete
2.2 Fully Associative¶
Characteristic: Memory block can be stored in any cache location
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Fully Associative Cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Entry β Valid β Tag β Data (64B) β
ββββββββββΌββββββββΌβββββββββββββββΌβββββββββββββββββββββββββββββ€
β 0 β 1 β 0x0001000 β [any block] β
β 1 β 1 β 0x0002008 β [any block] β
β 2 β 1 β 0x0001008 β [any block] β
β 3 β 1 β 0x0003010 β [any block] β
β 4 β 0 β - β - β
β 5 β 1 β 0x0002000 β [any block] β
β 6 β 1 β 0x0004020 β [any block] β
β 7 β 1 β 0x0001020 β [any block] β
ββββββββββ΄ββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββββββ
Search Process:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Compare request tag with all entry tags simultaneously β
β (parallel comparison) β
β β
β Request Tag: 0x0001008 β
β β
β Entry 0: 0x0001000 β 0x0001008 β Miss β
β Entry 1: 0x0002008 β 0x0001008 β Miss β
β Entry 2: 0x0001008 = 0x0001008 β Hit! β β
β Entry 3: 0x0003010 β 0x0001008 β Miss β
β ... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Advantages:
- No conflict misses
- Maximum flexibility
Disadvantages:
- Complex hardware (parallel tag comparison needed)
- Slow access (requires CAM - Content Addressable Memory)
- Expensive implementation
- Practically used only for small structures like TLB
2.3 Set Associative¶
Characteristic: Compromise between direct mapped and fully associative
- Cache divided into multiple sets
- Fully associative within each set
N-way Set Associative:
- N cache lines (ways) per set
- Memory block can only map to one set
- Can be placed in any way within the set
Example: 4-way Set Associative (32 lines, 8 sets)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4-way Set Associative Cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β Way 0 β Way 1 β Way 2 β Way 3 β
β Set β VβTagβDataβ VβTagβDataβ VβTagβDataβ VβTagβData β
ββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββββ€
β 0 β 1β100β... β 1β108β... β 0β - β - β 1β110β... β
β 1 β 1β101β... β 1β109β... β 1β111β... β 0β - β - β
β 2 β 1β102β... β 0β - β - β 1β10Aβ... β 1β112β... β
β 3 β 1β103β... β 1β10Bβ... β 1β113β... β 1β11Bβ... β
β 4 β 0β - β - β 1β10Cβ... β 0β - β - β 1β114β... β
β 5 β 1β105β... β 1β10Dβ... β 1β115β... β 0β - β - β
β 6 β 1β106β... β 0β - β - β 1β10Eβ... β 1β116β... β
β 7 β 1β107β... β 1β10Fβ... β 1β117β... β 1β11Fβ... β
ββββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββ
Address decomposition (32-bit, 4-way, 8 sets, 64B line):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 31 9 8 6 5 0 β
β βββββββββββββββββββββββββΌββββββββββββΌβββββββββββββββββββ€ β
β β Tag (23 bits) βSet Index β Block Offset β β
β β β (3 bits) β (6 bits) β β
β βββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.4 Mapping Scheme Comparison¶
βββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β Property βDirect MappedβSet AssociativeβFully Assoc. β
βββββββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Flexibility β Low β Medium β High β
βββββββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β HW Complexity β Low β Medium β High β
βββββββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Access Speed β Fast β Medium β Slow β
βββββββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Conflict Misses β Many β Few β None β
βββββββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β # Comparators β 1 β N (N-way) β All lines β
βββββββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Practical Use β Early cache β Most caches β TLB β
βββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
Typical configurations (modern processors):
- L1 cache: 8-way set associative
- L2 cache: 4-8 way set associative
- L3 cache: 16-way set associative
- TLB: Fully associative or high associativity
3. Cache Replacement Policies¶
3.1 Need for Replacement Policy¶
In set associative or fully associative caches:
- New block must be loaded on cache miss
- When set (or entire cache) is full
- Decision needed on which block to replace
Goal: Replace block that won't be used in future β Minimize miss rate
Problem: Cannot predict future β Use heuristics
3.2 LRU (Least Recently Used)¶
Principle: Replace the block that was used longest ago
Rationale: Recently used blocks are likely to be used again soon (temporal locality)
Example: 4-way cache, access order A, B, C, D, E, A, B, F
Initial state (empty cache):
ββββββ¬βββββ¬βββββ¬βββββ
β - β - β - β - β LRU order: -
ββββββ΄βββββ΄βββββ΄βββββ
Access A (Miss):
ββββββ¬βββββ¬βββββ¬βββββ
β A β - β - β - β LRU order: A
ββββββ΄βββββ΄βββββ΄βββββ
Access B (Miss):
ββββββ¬βββββ¬βββββ¬βββββ
β A β B β - β - β LRU order: A, B
ββββββ΄βββββ΄βββββ΄βββββ
Access C (Miss):
ββββββ¬βββββ¬βββββ¬βββββ
β A β B β C β - β LRU order: A, B, C
ββββββ΄βββββ΄βββββ΄βββββ
Access D (Miss):
ββββββ¬βββββ¬βββββ¬βββββ
β A β B β C β D β LRU order: A, B, C, D
ββββββ΄βββββ΄βββββ΄βββββ
Access E (Miss, replace A - oldest):
ββββββ¬βββββ¬βββββ¬βββββ
β E β B β C β D β LRU order: B, C, D, E
ββββββ΄βββββ΄βββββ΄βββββ
Access A (Miss, replace B):
ββββββ¬βββββ¬βββββ¬βββββ
β E β A β C β D β LRU order: C, D, E, A
ββββββ΄βββββ΄βββββ΄βββββ
Access B (Miss, replace C):
ββββββ¬βββββ¬βββββ¬βββββ
β E β A β B β D β LRU order: D, E, A, B
ββββββ΄βββββ΄βββββ΄βββββ
LRU Implementation:
- Exact LRU: N! states needed β High hardware cost
- Approximate LRU: Bit matrix, tree-based, etc.
3.3 Pseudo-LRU (Tree-based)¶
Tree-based approximate LRU (4-way example):
βββββ
β 0 β β Root: 0 means left more recent, 1 means right
βββ¬ββ
βββββ΄ββββ
βββ΄ββ βββ΄ββ
β 0 β β 1 β β Internal nodes
βββ¬ββ βββ¬ββ
ββββ΄βββ ββββ΄βββ
β β β β
Way0 Way1 Way2 Way3
Finding way to replace:
1. Start from root
2. If bit is 0, go left; if 1, go right
3. At leaf, replace that way
Update bits on access:
- Set bits on path to accessed way in opposite direction
- Guides next replacement away from that way
Advantage: Only log2(N) bits needed (4-way: 3 bits vs exact LRU: more)
Disadvantage: Doesn't guarantee exact LRU order
3.4 FIFO (First In, First Out)¶
Principle: Replace the block that entered first
Example: 4-way cache, access order A, B, C, D, E, A
ββββββ¬βββββ¬βββββ¬βββββ
β - β - β - β - β Head=0
ββββββ΄βββββ΄βββββ΄βββββ
A (Miss) β B (Miss) β C (Miss) β D (Miss):
ββββββ¬βββββ¬βββββ¬βββββ
β A β B β C β D β Head=0 (A is first)
ββββββ΄βββββ΄βββββ΄βββββ
E (Miss, replace A):
ββββββ¬βββββ¬βββββ¬βββββ
β E β B β C β D β Head=1 (B is next)
ββββββ΄βββββ΄βββββ΄βββββ
A (Miss, replace B - even though A was just accessed!):
ββββββ¬βββββ¬βββββ¬βββββ
β E β A β C β D β Head=2
ββββββ΄βββββ΄βββββ΄βββββ
Advantages:
- Very simple implementation (one pointer)
- Fair replacement
Disadvantages:
- Ignores access patterns
- Belady's anomaly: Miss rate can increase with more cache
3.5 Random Replacement¶
Principle: Randomly select block to replace
Implementation:
- Use random number generator
- Or simple counter
ββββββ¬βββββ¬βββββ¬βββββ
β A β B β C β D β
ββββββ΄βββββ΄βββββ΄βββββ
Insert E, random selection = Way 2:
ββββββ¬βββββ¬βββββ¬βββββ
β A β B β E β D β C replaced
ββββββ΄βββββ΄βββββ΄βββββ
Advantages:
- Very simple implementation
- Robust against pathological access patterns
- No Belady's anomaly
Disadvantages:
- Average performance lower than LRU
- Recently used blocks may be replaced
Practical Use:
- Some ARM processor caches
- Random+LRU hybrid in L2/L3 caches
3.6 Replacement Policy Comparison¶
Miss rate comparison (typical workloads):
ββββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββββββββββ
β Policy β Relative Miss β Features β
β β Rate β β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββββββββ€
β Optimal (OPT) β 1.00x β Theoretical optimal β
β β β (requires future knowledge)β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββββββββ€
β LRU β 1.05x β Very good, complex impl. β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββββββββ€
β Pseudo-LRU β 1.08x β LRU approximation, β
β β β efficient implementation β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββββββββ€
β Random β 1.15x β Simple, reasonable perf. β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββββββββ€
β FIFO β 1.20x β Simplest, lowest perf. β
ββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββββββββββ
4. Cache Write Policies¶
4.1 Write-Through¶
Principle: Update both cache and memory on write
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write-Through Operation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CPU β
β β β
β β Write X = 5 β
β βΌ β
β βββββββββββ β
β β Cache β ββββββββββββββββββββββββ β
β β X = 5 β β X = 5 β
β βββββββββββ βΌ β
β ββββββββββββ β
β β Memory β β
β β X = 5 β β
β ββββββββββββ β
β β
β Cache and memory always consistent β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Advantages:
- Cache and memory always consistent
- No write-back needed on cache replacement
- Simple implementation
- Easy data recovery (power failure, etc.)
Disadvantages:
- Memory access on every write β slow
- Consumes memory bandwidth
Mitigation: Write Buffer
ββββββββββββββββββββββββββββββββββββββββββββββββ
β CPU β Cache β Write Buffer β Memory β
β β β
β CPU continues until buffer fills β
ββββββββββββββββββββββββββββββββββββββββββββββββ
4.2 Write-Back¶
Principle: Update only cache on write, update memory on replacement
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write-Back Operation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Write occurs β
β CPU β
β β β
β β Write X = 5 β
β βΌ β
β βββββββββββ β
β β Cache β No memory access! β
β β X = 5 ββββ Dirty = 1 β
β βββββββββββ β
β β
β 2. On cache replacement β
β βββββββββββ β
β β Cache β β
β β X = 5 β βββ If Dirty = 1, write to memory ββββ β
β βββββββββββ βΌ β
β ββββββββββββ β
β β Memory β β
β β X = 5 β β
β ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dirty Bit:
- Indicates if cache line has been modified
- 1: Modified (inconsistent with memory)
- 0: Not modified (consistent with memory)
Advantages:
- Improved write performance (reduced memory access)
- Efficient for repeated writes to same address
- Saves memory bandwidth
Disadvantages:
- Complex implementation (dirty bit, write-back on replacement)
- Cache-memory inconsistency possible
- Delay on replacement
4.3 Write Allocate vs No-Write Allocate¶
How to handle write misses:
Write Allocate (Fetch on Write):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write miss occurs β
β β β
β Load block from memory to cache β
β β β
β Perform write in cache β
β β
β Typically used with Write-Back β
β Exploits spatial locality: adjacent data also loaded β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
No-Write Allocate (Write Around):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write miss occurs β
β β β
β Write directly to memory (no cache load) β
β β
β Typically used with Write-Through β
β Efficient for data written once and not read β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common combinations:
ββββββββββββββββββββ¬ββββββββββββββββββ
β Write Policy β Allocate Policyβ
ββββββββββββββββββββΌββββββββββββββββββ€
β Write-Back β Write Allocate β
β Write-Through β No-Write Alloc β
ββββββββββββββββββββ΄ββββββββββββββββββ
4.4 Write Policy Comparison¶
βββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ
β Property β Write-Through β Write-Back β
βββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β Write Latency β High (memory) β Low (cache) β
βββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β Implementation β Low β High β
β Complexity β β β
βββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β Memory Traffic β High β Low β
βββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β Data Consistencyβ Guaranteed β Needs management β
βββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β Replacement β None β If dirty, yes β
β Overhead β β β
βββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β Practical Use β L1 (some), β L1, L2, L3 β
β β Embedded β β
βββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ
5. Types of Cache Misses¶
5.1 3C Classification¶
Three causes of cache misses (3C Model):
1. Cold Miss (Compulsory Miss)
2. Capacity Miss
3. Conflict Miss
5.2 Cold Miss (Compulsory Miss)¶
Definition: Miss on first access to a block
- Occurs when cache is empty
- Occurs regardless of cache size
- Unavoidable (can mitigate with prefetching)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β At program start: β
β β
β Access 1: A β Miss (first access to A) β
β Access 2: B β Miss (first access to B) β
β Access 3: A β Hit (already in cache) β
β Access 4: C β Miss (first access to C) β
β β
β First 3 misses are Cold Misses β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Mitigation methods:
- Prefetching
- Larger block size (exploit spatial locality)
5.3 Capacity Miss¶
Definition: Miss due to insufficient cache capacity
- Occurs when Working Set > Cache size
- Previously cached block replaced due to capacity, then re-accessed
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4-block cache, access order: A, B, C, D, E, A β
β β
β Access A β Miss, cache: [A, -, -, -] β
β Access B β Miss, cache: [A, B, -, -] β
β Access C β Miss, cache: [A, B, C, -] β
β Access D β Miss, cache: [A, B, C, D] β Cache full β
β Access E β Miss, cache: [E, B, C, D] β A replaced β
β Access A β Miss! β Capacity Miss β
β cache: [E, A, C, D] (A evicted due to β
β capacity) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Mitigation methods:
- Increase cache size
- Optimize data structures (reduce Working Set)
- Loop tiling and other algorithm optimizations
5.4 Conflict Miss¶
Definition: Miss due to blocks competing for same cache location
- Occurs in direct mapped or set associative caches
- Miss despite space available in cache due to location contention
- Does not occur in fully associative caches
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2-way cache, 2 sets, access order: A, E, I (all map to Set 0)β
β β
β Initial: Set 0: [-, -], Set 1: [-, -] β
β β
β Access A (Set 0) β Miss β
β Set 0: [A, -], Set 1: [-, -] β
β β
β Access E (Set 0) β Miss β
β Set 0: [A, E], Set 1: [-, -] β Set 0 full β
β β
β Access I (Set 0) β Miss β
β Set 0: [I, E], Set 1: [-, -] β A replaced, Set 1 empty!β
β β
β Access A (Set 0) β Miss! β Conflict Miss β
β Set 0: [I, A], Set 1: [-, -] (Set 1 still empty) β
β β
β 50% of cache empty but miss occurs = Conflict Miss β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Mitigation methods:
- Increase associativity (increase N-way)
- Skewed cache
- Victim cache
5.5 3C Miss Proportions¶
Typical program miss proportions:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β β 100%β ββββββββββββββββββββββββββββββββββββββββ β
β β β β Compulsory (Cold) β β
β β β βββββββββββββββββββββββββββββ β
β β 50%β β Capacity β β
β β β βββββββββββββββββββββββββββββββββββββββββ β
β β β β Conflict β β
β β 0%β β
β βββββββ΄βββββββββββββββββββββββββββββββββββββββββ β
β Small cache βββββββββββββββββΆ Large cache β
β β
β - Increasing cache size: Reduces Capacity Miss β
β - Increasing associativity: Reduces Conflict Miss β
β - Cold Miss: Almost independent of cache size β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6. Multi-Level Cache¶
6.1 Need for Multi-Level Cache¶
Problem: L1 cache optimization dilemma
- Optimizing hit time β Small cache, low associativity
- Optimizing miss rate β Large cache, high associativity
Solution: Multi-level cache
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CPU β Fast access needed β
β β β
β ββββββ΄βββββ β
β β L1 Cacheβ Small and fast (hit time optimized) β
β β 32KB β 4 cycles β
β ββββββ¬βββββ β
β β β
β ββββββ΄βββββ β
β β L2 Cacheβ Medium (balanced) β
β β 256KB β 12 cycles β
β ββββββ¬βββββ β
β β β
β ββββββ΄βββββ β
β β L3 Cacheβ Large and shared (miss rate optimized) β
β β 8-32MB β 40 cycles β
β ββββββ¬βββββ β
β β β
β ββββββ΄βββββ β
β β Memory β 100+ cycles β
β βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6.2 Inclusive vs Exclusive Cache¶
Inclusive Cache:
- Lower level contains data of upper levels
- All data in L1 is also in L2
- All data in L2 is also in L3
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β L1: [A, B, C, D] β
β L2: [A, B, C, D, E, F, G, H, ...] β
β L3: [A, B, C, D, E, F, G, H, ..., X, Y, Z] β
β β
β Advantages: β
β - Easy to maintain cache coherence β
β - Only need to check L3 when snooping β
β β
β Disadvantages: β
β - Reduced effective capacity (duplicate storage) β
β - L3 replacement requires L1/L2 invalidation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Exclusive Cache:
- Each level stores different data
- A block exists in only one level
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β L1: [A, B, C, D] β
β L2: [E, F, G, H] (data not in L1) β
β L3: [I, J, K, L] (data not in L1 or L2) β
β β
β Advantages: β
β - Total effective capacity = L1 + L2 + L3 β
β β
β Disadvantages: β
β - Complex data movement between caches β
β - L1 miss requires sequential L2, L3 search β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
NINE (Non-Inclusive, Non-Exclusive):
- No consistency guarantee, freely placed
- Complex implementation but flexible
6.3 Private vs Shared Cache¶
Modern multicore processor structure:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Core 0 Core 1 Core 2 Core 3 β
β ββββββββ ββββββββ ββββββββ βββββββββ
β β L1-I β β L1-I β β L1-I β β L1-I ββ
β β L1-D β β L1-D β β L1-D β β L1-D ββ
β ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ ββββ¬βββββ
β β β β β β
β ββββ΄ββββ ββββ΄ββββ ββββ΄ββββ ββββ΄βββββ
β β L2 β β L2 β β L2 β β L2 ββ
β β(priv)β β(priv)β β(priv)β β(priv)ββ
β ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ ββββ¬βββββ
β β β β β β
β βββββββββββββββββ΄ββββββββ¬ββββββββ΄ββββββββββββββββ β
β β β
β ββββββββββ΄βββββββββ β
β β L3 β β
β β (shared) β β
β β 8-32MB β β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββ΄βββββββββ β
β β Main Memory β β
β βββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Private Cache (L1, L2):
- Dedicated to each core
- Minimizes access latency
- No inter-core interference
Shared Cache (L3):
- Shared by all cores
- Efficient for inter-core data sharing
- Dynamic capacity allocation (more space for busy cores)
- Requires coherence maintenance
6.4 Local vs Global Miss Rate¶
Local Miss Rate:
Miss rate at that level
Global Miss Rate:
Proportion of total accesses reaching that level
Example:
- L1 miss rate: 5%
- L2 miss rate (Local): 20%
- L3 miss rate (Local): 30%
Global miss rate calculation:
- L2 Global Miss Rate = L1 Miss Rate Γ L2 Local Miss Rate
= 0.05 Γ 0.20 = 1%
(1% of all accesses reach L3)
- L3 Global Miss Rate = L2 Global Γ L3 Local
= 0.01 Γ 0.30 = 0.3%
(0.3% of all accesses reach memory)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Local vs Global Miss Rate β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Level β Local Miss Rate β Global Miss Rate β
βββββββββββΌββββββββββββββββββββΌβββββββββββββββββββββββββββββ-β€
β L1 β 5% β 5% β
β L2 β 20% β 1% (0.05 Γ 0.20) β
β L3 β 30% β 0.3% (0.01 Γ 0.30) β
βββββββββββ΄ββββββββββββββββββββ΄βββββββββββββββββββββββββββββββ
7. Cache Optimization Techniques¶
7.1 Victim Cache¶
Purpose: Reduce Conflict Misses
Structure:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββ β
β β Main Cache (Direct Mapped) β β
β ββββββββββββββββββββ¬βββββββββββββββββββ β
β β β
β Replaced block β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββ β
β β Victim Cache (Fully Associative) β β
β β 4-8 entries β β
β βββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Operation:
1. Main Cache miss occurs
2. Search Victim Cache
3. If in Victim Cache β Swap and hit
4. If not β Load from memory, replaced block goes to Victim Cache
Effect: Absorbs conflict misses up to Victim Cache size
7.2 Prefetching¶
Hardware Prefetch:
- CPU detects access patterns
- Preloads blocks expected to be needed
Software Prefetch:
- Programmer/compiler inserts prefetch instructions
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β // Software prefetch example β
β for (int i = 0; i < N; i++) { β
β __builtin_prefetch(&a[i + 16], 0, 3); // Prefetch 16 β
β sum += a[i]; // ahead β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Prefetch effect:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Without Prefetch: β
β Request ββββββ¬βββββ Memory Latency ββββββ¬βββββ Process β
β β
β With Prefetch: β
β Prefetch βββββ¬βββββ Memory Latency ββββββ¬ β
β β β β
β Request βββΌβββββ Data arrives! βββββββΌβββββ Process β
β β (fast) β
β Prefetch hides memory latency β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.3 Compiler Optimizations¶
Loop Interchange:
// Before: Column-major (poor locality)
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
a[i][j] = b[i][j] + c[i][j];
// After: Row-major (good locality)
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = b[i][j] + c[i][j];
Loop Fusion:
// Before: Two loops
for (i = 0; i < N; i++) a[i] = b[i] + 1;
for (i = 0; i < N; i++) c[i] = a[i] * 2;
// After: One loop (use a[i] while in cache)
for (i = 0; i < N; i++) {
a[i] = b[i] + 1;
c[i] = a[i] * 2;
}
Loop Tiling:
// Before
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
// After (block-wise processing for cache reuse)
for (ii = 0; ii < N; ii += BLOCK)
for (jj = 0; jj < N; jj += BLOCK)
for (kk = 0; kk < N; kk += BLOCK)
for (i = ii; i < ii+BLOCK; i++)
for (j = jj; j < jj+BLOCK; j++)
for (k = kk; k < kk+BLOCK; k++)
C[i][j] += A[i][k] * B[k][j];
8. Practice Problems¶
Basic Problems¶
-
With a 32KB direct-mapped cache and 64-byte blocks, how many cache lines are there?
-
In a 4-way set associative cache, how many locations can a specific memory block be stored in?
-
Explain the difference between Write-Through and Write-Back.
Intermediate Problems¶
- Find the Tag, Index, and Offset for the following address:
- Address: 0x12345678
-
Cache: 16KB, 2-way set associative, 64-byte lines
-
Classify the type of cache miss for the following access pattern: (4-line direct-mapped cache, blocks A~D all map to same index)
A, B, C, D, A, E, A -
L1 cache: 2-cycle hit time, 10% miss rate L2 cache: 10-cycle hit time, 5% miss rate Memory: 100-cycle access time Calculate the AMAT.
Advanced Problems¶
-
Analyze the cache performance of the following code (64-byte cache line, 4-byte int):
c int a[1024], b[1024], c[1024]; for (int i = 0; i < 1024; i++) c[i] = a[i] + b[i]; -
How many bits per set are needed to implement exact LRU policy in an 8-way set associative cache?
-
Explain the principle by which Victim Cache reduces Conflict Misses.
Answers
1. Number of cache lines = 32KB / 64B = 512 lines 2. 4 locations (4-way = 4 locations per set) 3. Write-Through: Updates cache and memory simultaneously, guarantees consistency Write-Back: Updates only cache, updates memory on replacement, better performance 4. Address decomposition: - 16KB, 2-way β 128 sets (16KB / 64B / 2 = 128) - Offset: 6 bits (64B = 2^6) - Index: 7 bits (128 = 2^7) - Tag: 32 - 6 - 7 = 19 bits 0x12345678 = 0001 0010 0011 0100 0101 0110 0111 1000 - Offset: 111000 (0x38) - Index: 1011001 (0x59) - Tag: 0001 0010 0011 0100 0101 (0x12345) 5. Miss classification: - A: Cold Miss (first access) - B: Cold Miss - C: Cold Miss - D: Cold Miss - A: Conflict Miss (replaced by D) - E: Conflict Miss + Cold Miss - A: Conflict Miss (replaced by E) 6. AMAT = 2 + 0.10 Γ (10 + 0.05 Γ 100) = 2 + 0.10 Γ 15 = 2 + 1.5 = 3.5 cycles 7. Cache analysis: - 64-byte line = 16 ints - Each array 1024 ints = 64 cache lines - Sequential access β 1 miss per 16 accesses - Cold Misses: 64 Γ 3 = 192 (for a, b, c each) - Hits: (1024 - 64) Γ 3 = 2880 - Hit rate: 2880 / 3072 = 93.75% 8. Exact LRU implementation: - Representing order of 8 ways: 8! = 40320 states - Bits needed: log2(40320) β 16 bits (In practice, approximate LRU is used) 9. Victim Cache principle: - Stores blocks replaced from Main Cache in Victim Cache - When a block replaced due to conflict is needed again, quickly found in Victim Cache - Small fully associative cache absorbs conflict missesNext Steps¶
- 16_Virtual_Memory.md - Virtual address space, page tables, TLB
References¶
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
- Cache Memory Visualization
- Intel Optimization Manual - Cache
- What Every Programmer Should Know About Memory (Ulrich Drepper)