Superscalar and Out-of-Order Execution
Superscalar and Out-of-Order Execution¶
Overview¶
Modern high-performance processors cannot achieve significant performance gains by simply increasing clock speed. Superscalar and Out-of-Order Execution are techniques that exploit Instruction-Level Parallelism (ILP) to execute multiple instructions simultaneously in a single cycle. This lesson covers ILP concepts, superscalar architecture, and the principles and implementation of out-of-order execution.
Difficulty: ββββ
Prerequisites: Pipelining, Branch Prediction, CPU Architecture Basics
Table of Contents¶
- Instruction-Level Parallelism (ILP)
- Superscalar Processors
- Need for Out-of-Order Execution
- Register Renaming
- Tomasulo Algorithm
- Reorder Buffer (ROB)
- Modern Processor Implementations
- Practice Problems
1. Instruction-Level Parallelism (ILP)¶
1.1 ILP Concept¶
Instruction-Level Parallelism (ILP) refers to the potential parallelism of instructions within a program that can be executed simultaneously.
Sequential vs Parallel Execution:
Sequential Execution:
Time β
t1 t2 t3 t4 t5 t6
βββββββΌββββββΌββββββΌββββββΌββββββΌββββββ€
β I1 β I2 β I3 β I4 β I5 β I6 β
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
Total 6 cycles
Parallel Execution (ILP = 2):
Time β
t1 t2 t3
βββββββΌββββββΌββββββ€
β I1 β I3 β I5 β
β I2 β I4 β I6 β
βββββββ΄ββββββ΄ββββββ
Total 3 cycles
1.2 Data Dependence¶
The most important factor limiting ILP is data dependence.
RAW (Read After Write) - True Dependence¶
I1: ADD R1, R2, R3 ; R1 = R2 + R3
I2: SUB R4, R1, R5 ; R4 = R1 - R5 (uses R1, needs I1's result)
I1 βββββββββ I2
R1 dependence
I2 can only execute after I1 writes to R1
WAR (Write After Read) - Anti-dependence¶
I1: ADD R1, R2, R3 ; Read R2
I2: SUB R2, R4, R5 ; Write R2 (must write after I1 reads R2)
I2 must not overwrite R2 before I1 reads it
β Can be resolved by register renaming
WAW (Write After Write) - Output Dependence¶
I1: ADD R1, R2, R3 ; Write R1
I2: SUB R1, R4, R5 ; Write R1 (writing to same register)
If I2 completes before I1, final R1 value is wrong
β Can be resolved by register renaming
1.3 Dependence Graph¶
Program:
I1: LD R1, 0(R10) ; Load from memory
I2: ADD R2, R1, R3 ; RAW on R1
I3: LD R4, 8(R10) ; Independent
I4: MUL R5, R4, R6 ; RAW on R4
I5: ADD R7, R2, R5 ; RAW on R2, R5
I6: ST R7, 16(R10) ; RAW on R7
Dependence graph:
I1 I3
β β
βΌ βΌ
I2 I4
β β
βββββββ¬ββββββ
β
βΌ
I5
β
βΌ
I6
Parallelizable groups:
- Level 1: I1, I3 (can execute simultaneously)
- Level 2: I2, I4 (can execute simultaneously)
- Level 3: I5
- Level 4: I6
Minimum execution time: 4 levels (4 cycles) vs sequential: 6 cycles
ILP = 6/4 = 1.5
1.4 Control Dependence¶
BEQ R1, R2, LABEL ; Branch instruction
ADD R3, R4, R5 ; Execute if branch not taken
...
LABEL:
SUB R6, R7, R8 ; Execute if branch taken
Don't know which instruction to execute before branch result
β Resolved by branch prediction
2. Superscalar Processors¶
2.1 Superscalar Concept¶
A superscalar processor can fetch, decode, and execute multiple instructions per cycle.
Scalar vs Superscalar:
Scalar Pipeline (IPC β€ 1):
βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ
β IF β ID β EX β MEM β WB β I1
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ
β IF β ID β EX β MEM β WB β I2
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
2-way Superscalar (IPC β€ 2):
βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ
β IF β ID β EX β MEM β WB β I1
βββββββΌββββββΌββββββΌββββββΌββββββ€
β IF β ID β EX β MEM β WB β I2
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ
β IF β ID β EX β MEM β WB β I3
βββββββΌββββββΌββββββΌββββββΌββββββ€
β IF β ID β EX β MEM β WB β I4
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
2.2 Superscalar Structure¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4-way Superscalar Processor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Instruction Fetch Unit β β
β β ββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β β β PC ββββββ Instruction Cache (I-Cache) β β β
β β ββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β 4 instructions/cycle β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Instruction Decode Unit β β
β β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β β
β β βDecoder1β βDecoder2β βDecoder3β βDecoder4β β β
β β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Issue Unit β β
β β (Dependence checking and execution unit alloc) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββ¬βββββββββββΌβββββββββββ¬ββββββββββββ β
β βΌ βΌ βΌ βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ALU 1 ββ ALU 2 ββ FPU ββ Load ββ Store β β
β β ββ ββ ββ Unit ββ Unit β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β β β β
β βββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Write-back / Commit β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.3 Issue Policies¶
In-Order Issue¶
Instructions issued to execution units in program order
Program:
I1: ADD R1, R2, R3
I2: MUL R4, R1, R5 ; Depends on I1 (stall)
I3: SUB R6, R7, R8 ; Independent but waits behind I2
Timeline:
Cycle 1: Issue I1
Cycle 2: I1 executing, I2 waiting (RAW hazard)
Cycle 3: Issue I2 (after I1 completes)
Cycle 4: Issue I3
Problem: I3 is independent but delayed by I2
Out-of-Order Issue¶
Independent instructions issued regardless of order
Program:
I1: ADD R1, R2, R3
I2: MUL R4, R1, R5 ; Depends on I1
I3: SUB R6, R7, R8 ; Independent
Timeline:
Cycle 1: Issue I1, I3 (simultaneous)
Cycle 2: I1 completes, Issue I2
Cycle 3: I2 executing
Performance gain: I3 doesn't wait for I2
2.4 Diverse Execution Units¶
Modern processor execution units example:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Execution Units (Intel Core) β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ€
β Port 0 β ALU, FP MUL, FP DIV, Branch β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 1 β ALU, FP ADD, LEA β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 2 β Load (Address Gen) β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 3 β Load (Address Gen) β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 4 β Store Data β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 5 β ALU, Vector Shuffle, Branch β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 6 β ALU, Branch β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Port 7 β Store (Address Gen) β
ββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
Total: 8 ports, max 8 micro-ops per cycle
3. Need for Out-of-Order Execution¶
3.1 Limitations of In-Order Execution¶
Program:
I1: LD R1, 0(R10) ; Cache miss - 100 cycles
I2: ADD R2, R1, R3 ; Depends on I1
I3: LD R4, 8(R10) ; Independent - cache hit 4 cycles
I4: MUL R5, R4, R6 ; Depends on I3
I5: ADD R7, R8, R9 ; Completely independent
In-Order Execution:
Cycle 1-100: I1 executing (cache miss wait)
Cycle 101: I2 executes
Cycle 102-105: I3 executes
Cycle 106: I4 executes
Cycle 107: I5 executes
Total: 107 cycles
Out-of-Order Execution:
Cycle 1: I1 starts (cache miss)
Cycle 2-5: I3 executes (parallel with I1)
Cycle 6: I4 executes
Cycle 7: I5 executes
...
Cycle 100: I1 completes
Cycle 101: I2 executes
Total: 101 cycles
Speedup: 107/101 = 1.06x (simple example)
Actual difference is much larger
3.2 Three Stages of OoO Execution¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Out-of-Order Execution Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β In-Order Out-of-Order In-Order β
β Front-end Execution Back-end β
β β
β βββββββββββ βββββββββββββββ βββββββββββββββ β
β β Fetch β β Issue/ β β Commit/ β β
β β Decode ββββββ Execute ββββββ Retire β β
β β Rename β β (OoO) β β (In-Order) β β
β βββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Program order Data flow order Program order β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Stage 1: Front-end (In-Order)
- Instruction fetch, decode
- Register renaming
- Insert instructions into Issue Queue
Stage 2: Execution (Out-of-Order)
- Execute instructions when operands ready
- Execution order determined by data dependencies
- Parallel execution across multiple units
Stage 3: Back-end (In-Order)
- Commit results in program order
- Ensures precise exception handling
- Update architectural registers
4. Register Renaming¶
4.1 Need for Renaming¶
Program:
I1: ADD R1, R2, R3 ; Write R1
I2: MUL R4, R1, R5 ; Read R1 (RAW)
I3: ADD R1, R6, R7 ; Write R1 (WAW with I1)
I4: SUB R8, R1, R9 ; Read R1 (RAW with I3)
Problem:
- I3 should be able to execute independently of I1, I2
- But they share R1, causing WAW dependence
- If I3 completes before I1, I2 reads wrong value
4.2 Renaming Operation¶
Architectural Registers: R1-R8 (programmer-visible)
Physical Registers: P1-P64 (actual hardware registers)
Before renaming:
I1: ADD R1, R2, R3
I2: MUL R4, R1, R5
I3: ADD R1, R6, R7
I4: SUB R8, R1, R9
After renaming:
I1: ADD P10, P2, P3 ; R1 β P10
I2: MUL P11, P10, P5 ; R4 β P11, R1 β P10
I3: ADD P12, P6, P7 ; R1 β P12 (new physical register!)
I4: SUB P13, P12, P9 ; R8 β P13, R1 β P12
Result:
- WAW dependence between I1 and I3 eliminated (different physical regs)
- I2 reads P10 (I1's result)
- I4 reads P12 (I3's result)
- I1 and I3 can now execute in parallel!
4.3 Register Alias Table (RAT)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Register Alias Table (RAT) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Architectural Reg β Physical Reg β Valid β
βββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββββ€
β R0 β P0 β 1 β
β R1 β P12 β 0 (pending) β
β R2 β P2 β 1 β
β R3 β P3 β 1 β
β R4 β P11 β 0 (pending) β
β R5 β P5 β 1 β
β R6 β P6 β 1 β
β R7 β P7 β 1 β
β R8 β P13 β 0 (pending) β
β ... β ... β ... β
βββββββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββββ
Free List: P14, P15, P16, ... (available physical registers)
4.4 Renaming Algorithm¶
Renaming process (instruction: ADD Rd, Rs1, Rs2):
1. Source register renaming:
- Look up physical registers for Rs1, Rs2 in RAT
2. Destination register renaming:
- Allocate new physical register from Free List
- Update RAT mapping for Rd to new physical register
3. Record dependence info:
- Check if source physical registers are still being produced
- Establish links to producer instructions
Example:
Instruction: ADD R1, R2, R3
Before:
RAT[R1] = P5, RAT[R2] = P2, RAT[R3] = P3
Free List: P10, P11, P12, ...
Renaming:
1. Rs1(R2) β P2, Rs2(R3) β P3
2. Rd(R1) β P10 (new allocation)
3. RAT[R1] = P10
After:
RAT[R1] = P10, RAT[R2] = P2, RAT[R3] = P3
Free List: P11, P12, ...
Renamed instruction: ADD P10, P2, P3
5. Tomasulo Algorithm¶
5.1 Background¶
Algorithm developed by Robert Tomasulo in 1967 for the IBM 360/91 floating-point unit. Forms the basis of modern out-of-order execution processors.
5.2 Key Components¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tomasulo Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Instruction Queue β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Issue Logic β β
β β (Reservation Station allocation) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββ΄ββββββββββββββββββ β
β βΌ βΌ β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Reservation Stations β β Reservation Stationsβ β
β β (Add/Sub) β β (Mul/Div) β β
β ββββββββββββββββββββββββ€ ββββββββββββββββββββββββ€ β
β β RS1: Op Vj Vk Qj Qk β β RS4: Op Vj Vk Qj Qk β β
β β RS2: Op Vj Vk Qj Qk β β RS5: Op Vj Vk Qj Qk β β
β β RS3: Op Vj Vk Qj Qk β β RS6: Op Vj Vk Qj Qk β β
β ββββββββββββ¬ββββββββββββ ββββββββββββ¬ββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β FP Adder β β FP Multiplier β β
β β (2 cycles) β β (10 cycles) β β
β ββββββββββββ¬ββββββββββββ ββββββββββββ¬ββββββββββββ β
β β β β
β βββββββββββββββββββ¬βββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Common Data Bus (CDB) β β
β β (Result broadcast) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Register File β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.3 Reservation Station Structure¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Reservation Station Entry β
ββββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬ββββββββββββ€
β Busy β Op β Vj β Vk β Qj β Qk β A β Dest β
ββββββββΌβββββββΌβββββββΌβββββββΌβββββββΌβββββββΌβββββββΌββββββββββββ€
β 1 β ADD β 3.5 β - β - β RS4 β - β F2 β
ββββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄ββββββββββββ
Field descriptions:
- Busy: Entry in use
- Op: Operation to perform
- Vj, Vk: Source operand values (if already available)
- Qj, Qk: RS that will produce value (if not ready yet)
- A: Memory address (for Load/Store)
- Dest: Destination register for result
5.4 Operation Process¶
Three-stage processing:
1. Issue
- Get instruction from Instruction Queue
- Allocate appropriate Reservation Station
- Record operand values or producer RS tags
2. Execute
- Start execution when all operands ready
- Execute when Qj = 0 AND Qk = 0
- Perform operation in execution unit
3. Write Result
- Broadcast result via CDB
- Waiting RSs receive result
- Update Register File
5.5 Example: Tomasulo Execution Trace¶
Program:
I1: LD F6, 34(R2)
I2: LD F2, 45(R3)
I3: MUL F0, F2, F4
I4: SUB F8, F6, F2
I5: DIV F10, F0, F6
I6: ADD F6, F8, F2
Initial state:
F4 = 2.5 (available)
Load: 2 cycles, Mul: 10 cycles, Add/Sub: 2 cycles, Div: 40 cycles
=== Cycle 1 ===
Issue I1: LD F6, 34(R2)
Load1: Busy=1, A=34+R2, Dest=F6
Register[F6]: Qi=Load1
=== Cycle 2 ===
Issue I2: LD F2, 45(R3)
Load2: Busy=1, A=45+R3, Dest=F2
Register[F2]: Qi=Load2
Execute I1: Memory access starts
=== Cycle 3 ===
Issue I3: MUL F0, F2, F4
Mult1: Busy=1, Op=MUL, Vk=2.5, Qj=Load2, Dest=F0
Register[F0]: Qi=Mult1
Execute I2: Memory access starts
Write I1: Broadcast F6 value on CDB
Load1: Busy=0
Register[F6]: Qi=0, Value=M[34+R2]
=== Cycle 4 ===
Issue I4: SUB F8, F6, F2
Add1: Busy=1, Op=SUB, Vj=M[34+R2], Qk=Load2, Dest=F8
Write I2: Broadcast F2 value on CDB
Mult1: Vj=M[45+R3], Qj=0 (value received)
Add1: Vk=M[45+R3], Qk=0 (value received)
=== Cycle 5 ===
Issue I5: DIV F10, F0, F6
Mult2: Busy=1, Op=DIV, Vk=M[34+R2], Qj=Mult1, Dest=F10
Execute I3: MUL starts (Vj, Vk ready)
Execute I4: SUB starts (Vj, Vk ready)
=== Cycle 6 ===
Issue I6: ADD F6, F8, F2
Add2: Busy=1, Op=ADD, Vk=M[45+R3], Qj=Add1, Dest=F6
Register[F6]: Qi=Add2
=== Cycle 7 ===
Write I4: Broadcast F8 value on CDB
Add2: Vj=(F6-F2), Qj=0 (value received)
=== Cycle 8 ===
Execute I6: ADD starts
... (continues)
=== Cycle 15 ===
Write I3: MUL complete (cycle 5+10)
Mult2: Vj=(F2*F4), Qj=0 (value received)
=== Cycle 16 ===
Execute I5: DIV starts
=== Cycle 56 ===
Write I5: DIV complete (cycle 16+40)
6. Reorder Buffer (ROB)¶
6.1 Need for ROB¶
Problem: Precise exception handling impossible with OoO execution
Example:
I1: LD R1, 0(R2) ; May cause page fault
I2: ADD R3, R4, R5 ; Completes before I1
If I2 completes first and I1 page faults:
- I2's result already written to R3
- State inconsistent after exception handling and restart
Solution: Reorder Buffer
- Temporarily store results
- Commit in program order
- Discard uncommitted results on exception
6.2 ROB Structure¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Reorder Buffer β
βββββββ¬ββββββββββ¬βββββββββββ¬ββββββββββ¬βββββββββ¬βββββββββββββββ€
βEntryβ Busy β State β Dest β Value β Instruction β
βββββββΌββββββββββΌβββββββββββΌββββββββββΌβββββββββΌβββββββββββββββ€
β 1 β 1 β Commit β F6 β 10.5 β LD F6,34(R2) β
β 2 β 1 β Commit β F2 β 5.0 β LD F2,45(R3) β
β 3 β 1 β Execute β F0 β - β MUL F0,F2,F4 β
β 4 β 1 β Write β F8 β 5.5 β SUB F8,F6,F2 β
β 5 β 1 β Issue β F10 β - β DIV F10,F0,F6β
β 6 β 1 β Issue β F6 β - β ADD F6,F8,F2 β
β 7 β 0 β - β - β - β - β
β 8 β 0 β - β - β - β - β
βββββββ΄ββββββββββ΄βββββββββββ΄ββββββββββ΄βββββββββ΄βββββββββββββββ
β β
Head Tail
(Commit) (Issue)
State:
- Issue: Issued, waiting to execute
- Execute: Executing
- Write: Execution complete, result recorded
- Commit: Ready to commit
6.3 Integrating ROB with Tomasulo¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Modern Out-of-Order Processor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Instruction Fetch/Decode β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Register Rename β β
β β (RAT + Physical Register File) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Reorder Buffer (ROB) Allocation β β
β β (Entry allocation, order tracking) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Issue Queue / Reservation Stations β β
β β (Dependence wait, issue control) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Execution β β Execution β β Memory β β
β β Unit 1 β β Unit 2 β β Unit β β
β βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ β
β β β β β
β βββββββββββββββββββββΌββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Write Back to ROB β β
β β (Record results in ROB) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Commit (In-Order) β β
β β (Update architectural registers from ROB Head) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6.4 Commit Process¶
Commit rules:
1. Only ROB Head instruction can commit
2. Instruction must be in complete state
3. All previous instructions must be committed
4. No exception present
On exception:
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β I1 β I2 β I3 β I4 β I5 β I6 β β
β Done β Done β Done β Exc! β Done β Done β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Head
Exception at I4:
1. Commit I1, I2, I3 complete
2. Exception detected at I4
3. Discard I5, I6 results (not committed yet)
4. Jump to exception handler at I4's address
5. Precise exception state maintained
6.5 Branch Misprediction Recovery¶
Recovery on branch misprediction:
ROB state:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β I1 β I2 β BR β I4 β I5 β I6 β β
β Done β Done βMis-Pβ Done β Done β Done β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
Head branch
Misprediction confirmed at BR:
1. Commit I1, I2
2. Detect misprediction at BR
3. Flush I4, I5, I6 (speculative execution results)
4. Restore RAT to BR checkpoint (checkpoint or ROB reverse walk)
5. Restart fetch at correct branch target
Recovery mechanisms:
- Checkpoint: Save RAT snapshot at each branch
- Gradual recovery: Walk ROB backwards to restore RAT
7. Modern Processor Implementations¶
7.1 Intel Core Architecture (Skylake)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Intel Skylake Microarchitecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Front-end (In-Order) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - Branch prediction: 32KB direction predictor, ~4K BTB β β
β β - L1 I-Cache: 32KB, 8-way β β
β β - Decode: 4-wide (complex instr β micro-ops) β β
β β - Micro-op Cache: ~1.5K micro-ops β β
β β - Allocation: 4 micro-ops/cycle β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Out-of-Order Engine β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - ROB: 224 entries β β
β β - Scheduler: 97 entries β β
β β - Physical Registers: 180 integer + 168 vector β β
β β - Load Buffer: 72 entries β β
β β - Store Buffer: 56 entries β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Execution Units (8 Ports) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Port 0: ALU, FMA, FP Div/Sqrt, Branch β β
β β Port 1: ALU, FMA, AES β β
β β Port 2: Load AGU β β
β β Port 3: Load AGU β β
β β Port 4: Store Data β β
β β Port 5: ALU, Shuffle, Branch β β
β β Port 6: ALU, Branch β β
β β Port 7: Store AGU β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Memory Subsystem β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - L1 D-Cache: 32KB, 8-way, 4 cycles β β
β β - L2 Cache: 256KB, 4-way, 12 cycles β β
β β - L3 Cache: 2MB/core, 16-way, ~40 cycles β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Performance Metrics β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - Theoretical max: 6 micro-ops execution/cycle β β
β β - Actual IPC: 2-4 depending on workload β β
β β - Pipeline depth: 14-19 stages β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.2 ARM Cortex-A77 Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ARM Cortex-A77 Microarchitecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Front-end β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - Fetch: 4 instructions/cycle β β
β β - Decode: 4-wide β β
β β - Macro-op Cache: 1.5K entries β β
β β - Branch prediction: TAGE-based β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Out-of-Order Engine β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - ROB: 160 entries β β
β β - Dispatch: 6 micro-ops/cycle β β
β β - Issue Queue: 120 entries β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Execution Units (10 pipelines) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - 2x Branch β β
β β - 3x Integer ALU β β
β β - 2x Integer Multi-Cycle β β
β β - 2x FP/NEON β β
β β - 2x Load + 1x Store β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.3 Performance Comparison¶
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Modern High-Performance Processor Comparison (2024) β
βββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ€
β Feature β Intel Golden β AMD Zen 4 β Apple M2 P-coreβ
β β Cove β β β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β Decode Width β 6-wide β 4-wide β 8-wide β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β ROB Size β 512 entry β 320 entry β ~600 entry β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β Issue Width β 12 ports β 10 ports β ~13 ports β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β L1 I-Cache β 32KB β 32KB β 192KB β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β L1 D-Cache β 48KB β 32KB β 128KB β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β L2 Cache β 1.25MB β 1MB β 16MB β
βββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β Actual IPC β ~3-4 β ~3-4 β ~4-5 β
βββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ
7.4 Limitations of ILP¶
Limiting factors for ILP exploitation:
1. True Data Dependence (RAW)
- Cannot be resolved by renaming
- Long dependency chains determine performance
2. Control Dependence
- Branch prediction accuracy limits (~95-97%)
- Pipeline flush on misprediction
3. Memory Dependence
- Load-Store dependence detection difficulty
- Constraints on memory instruction reordering
4. Window Size Limits
- ROB, Issue Queue size constraints
- Cannot exploit ILP far apart
Actual program ILP:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Program Type β Avg ILP β Limiting Factor β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Integer (SPEC INT) β 1.5-2.5 β Dependency chainsβ
β FP (SPEC FP) β 2.0-4.0 β Memory bandwidth β
β Media processing β 3.0-6.0 β SIMD utilization β
β Database β 1.0-2.0 β Branch predictionβ
β Web browser β 1.5-2.5 β Control flow β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8. Practice Problems¶
Basic Problems¶
-
Classify the data dependence types:
assembly I1: ADD R1, R2, R3 I2: SUB R4, R1, R5 I3: MUL R1, R6, R7 I4: DIV R8, R1, R9 -
What is the theoretical maximum IPC in a 3-way superscalar processor?
-
Which dependence types can register renaming resolve?
- (a) RAW
- (b) WAR
- (c) WAW
- (d) Control
Intermediate Problems¶
-
Apply register renaming to this program:
assembly I1: ADD R1, R2, R3 I2: MUL R4, R1, R5 I3: ADD R1, R6, R7 I4: SUB R8, R1, R4(Use physical registers starting from P10) -
List 3 reasons why ROB is needed.
-
What is the role of CDB (Common Data Bus) in Tomasulo algorithm?
Advanced Problems¶
-
Trace Reservation Station states for this Tomasulo execution:
assembly I1: LD F2, 0(R1) ; 3 cycles I2: MUL F4, F2, F0 ; 5 cycles I3: ADD F6, F4, F2 ; 2 cycles(Initial: F0 = 2.0, RS: Load1, Mult1, Add1, trace cycles 1-10) -
Explain ROB-based recovery process on branch misprediction.
-
Why does Intel Skylake have 224 ROB entries? What are the effects of making it larger or smaller?
Answers
1. Dependence classification: - I1βI2: RAW (R1) - I1βI3: WAW (R1) - I2βI3: WAR (R1 - I2 reads, I3 writes) - I3βI4: RAW (R1) 2. Theoretical max IPC of 3-way superscalar = 3 3. (b) WAR, (c) WAW - RAW is true dependence, cannot be resolved - Control resolved by branch prediction 4. Register renaming: ```assembly I1: ADD P10, P2, P3 ; R1 β P10 I2: MUL P11, P10, P5 ; R4 β P11 I3: ADD P12, P6, P7 ; R1 β P12 (new register) I4: SUB P13, P12, P11 ; R8 β P13 ``` I1 and I3 can now execute in parallel 5. ROB needed for: - Precise exception handling (in-order commit) - Branch misprediction recovery - Precise interrupt handling 6. CDB role: - Broadcast completed results to all RSs - Waiting instructions receive operands - Update register file 7. Tomasulo execution trace: ``` Cycle 1: Issue I1 β Load1 Cycle 2: Execute I1, Issue I2 β Mult1 (Qj=Load1) Cycle 3: Execute I1 continues, Issue I3 β Add1 (Qj=Mult1, Qk=Load1) Cycle 4: Write I1, Mult1: Vj update, Add1: Vk update Cycle 5: Execute I2 starts ... Cycle 9: Write I2, Add1: Vj update Cycle 10: Execute I3 starts Cycle 11-12: Execute I3 completes ``` 8. Branch misprediction recovery: - Invalidate results after branch in ROB - Restore RAT to branch checkpoint state - Restart fetch at correct branch target - Pipeline flush 9. ROB size tradeoffs: - Larger: More ILP exploitation, tolerates longer memory latency - Smaller: Reduced area/power, faster recovery - 224 is balance point for memory latency (~100 cycles) and 6-wide issueNext Steps¶
- 14_Memory_Hierarchy.md - Memory System Hierarchy and Locality Principles
References¶
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
- Intel Optimization Manual
- WikiChip - Microarchitectures
- Agner Fog's microarchitecture
- Tomasulo, R.M. "An Efficient Algorithm for Exploiting Multiple Arithmetic Units" (1967)