Parallel Processing and Multicore
Parallel Processing and Multicore¶
Overview¶
As single-processor performance improvements reach physical limits, modern computers are achieving higher performance through multicore and parallel processing. This lesson covers the fundamental concepts of parallel processing, multiprocessor/multicore architectures, cache coherence problems, synchronization mechanisms, and parallel computing using GPUs.
Difficulty: ββββ
Prerequisites: CPU architecture, cache memory, memory hierarchy
Table of Contents¶
- The Need for Parallel Processing
- Flynn's Taxonomy
- Multiprocessor and Multicore
- Cache Coherence Problem
- Snooping Protocol (MESI)
- Amdahl's Law and Gustafson's Law
- Synchronization and Locks
- GPU and Parallel Computing
- Practice Problems
1. The Need for Parallel Processing¶
1.1 Single Core Limitations¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CPU Clock Speed Evolution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Clock β
β (GHz) β
β β ββββββββ (Plateau) β
β 5 β ββββββ β
β β βββββ β
β 4 β βββββ β
β β βββββ β
β 3 β βββββ β
β β βββββ β
β 2 β βββββ β
β β β β
β 1 ββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 1995 2000 2005 2010 2015 2020 β
β β
β Clock speed plateau after 2005: β
β - Power Wall β
β - Heat dissipation issues β
β - Memory Wall β
β - ILP Wall β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Moore's Law and Dennard Scaling¶
Moore's Law:
- Transistor count doubles every 2 years
- Still valid (slowing down)
Dennard Scaling:
- Transistor size reduction β constant power density
- Broke down around 2006
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β After end of Dennard Scaling: β
β β
β Transistor count β + Clock speed plateau β How to use?β
β β
β Solution: Multicore β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Before 2005: After 2005: β β
β β β β
β β βββββββββββββββ βββββββ βββββββ βββββββ β β
β β β β βCore1β βCore2β βCore3β ... β β
β β β Single β β β β β β β β β
β β β High-Perf β βββββββ βββββββ βββββββ β β
β β β Core β β β
β β βββββββββββββββ Multiple efficient cores β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.3 Benefits of Parallel Processing¶
Performance Improvement:
- Process multiple tasks simultaneously
- Divide large problems into smaller parallel solutions
Energy Efficiency:
- Multiple cores at lower clock more efficient than single core at high clock
- Power β VoltageΒ² Γ Frequency
Reliability:
- Continue operation with other cores if one core fails
- Fault Tolerance
Scalability:
- Performance scaling through increased core count
- Vertical scaling (add cores) + Horizontal scaling (add nodes)
2. Flynn's Taxonomy¶
2.1 Classification System¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Flynn's Taxonomy β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β Single Data β Multiple Data β
β β (SD) β (MD) β
β βββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β β
β Single β SISD β SIMD β
β Instruction β (Single Instructionβ (Single Instruction β
β (SI) β Single Data) β Multiple Data) β
β β β β
β βββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β β
β Multiple β MISD β MIMD β
β Instruction β (Multiple Instr. β (Multiple Instr. β
β (MI) β Single Data) β Multiple Data) β
β β β β
β β (Rarely used) β β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 SISD (Single Instruction, Single Data)¶
Traditional von Neumann computer:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Instruction Stream Data Stream β
β β β β
β βΌ βΌ β
β βββββββββββββ βββββββββββββ β
β β I1 β β D1 β β
β β I2 β β D2 β β
β β I3 β β D3 β β
β β ... β β ... β β
β βββββββ¬ββββββ βββββββ¬ββββββ β
β β β β
β βββββββββββββ¬ββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββ β
β β CPU β β
β β (Single Core) β β
β βββββββββββββββββ β
β β
β Examples: Early microprocessors, embedded systems β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.3 SIMD (Single Instruction, Multiple Data)¶
Same operation applied to multiple data simultaneously:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Single Instruction Multiple Data β
β β ββββββββ¬βββββββ¬βββββββ β
β β β D1 β D2 β D3 β β
β βΌ ββββ¬ββββ΄βββ¬ββββ΄βββ¬ββββ β
β βββββββββββββ β β β β
β β ADD β βΌ βΌ βΌ β
β βββββββ¬ββββββ ββββββββββββββββββββββββββ β
β β β Processing Units β β
β β β ββββββββββββββββββ β β
β βββββββββββββββββββΆβ β PU1ββ PU2ββ PU3β β β
β β ββββββββββββββββββ β β
β ββββββββββββ¬ββββββββββββββ β
β β β
β ββββββββββββ΄βββββββββββ β
β β R1 β R2 β R3 β β
β ββββββββ΄βββββββ΄βββββββ β
β β
β Examples: β
β - Intel SSE, AVX (256/512-bit vectors) β
β - GPU warps/waves β
β - Image processing, scientific computing β
β β
β Code example (AVX): β
β __m256 a = _mm256_load_ps(arr1); β
β __m256 b = _mm256_load_ps(arr2); β
β __m256 c = _mm256_add_ps(a, b); // 8 floats added at onceβ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.4 MISD (Multiple Instruction, Single Data)¶
Multiple instructions processing same data stream:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Multiple Instructions Single Data β
β ββββββββββββββββββ β β
β β I1: Encrypt β β β
β β I2: Compress β βΌ β
β β I3: Checksum β βββββββββββ β
β βββββββββ¬βββββββββ β Data β β
β β ββββββ¬βββββ β
β β β β
β βΌ β β
β ββββββββββββββ β β
β β Pipeline βββββββββββββββββββββββ β
β β of Units β β
β ββββββββββββββ β
β β
β Real-world usage: β
β - Systolic arrays (some) β
β - Fault-tolerant systems (same computation multiple times) β
β - Very rarely used β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.5 MIMD (Multiple Instruction, Multiple Data)¶
Most common parallel computer form:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Multiple Instructions Multiple Data β
β ββββββββββββββββββ ββββββββββββββββββββββ β
β β I1: func_A() β β D1, D2, D3, D4 β β
β β I2: func_B() β β D5, D6, D7, D8 β β
β β I3: func_C() β β ... β β
β β I4: func_D() β β β β
β βββββββββ¬βββββββββ βββββββββββ¬ββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Multiple Processors β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β CPU 1 β β CPU 2 β β CPU 3 β β CPU 4 β β β
β β β func_A()β β func_B()β β func_C()β β func_D()β β β
β β β D1 β β D5 β β D2 β β D8 β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Examples: β
β - Multicore processors β
β - Multiprocessor servers β
β - Clusters, supercomputers β
β β
β MIMD Classification: β
β - Shared Memory: SMP, NUMA β
β - Distributed Memory: Clusters β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Multiprocessor and Multicore¶
3.1 SMP (Symmetric Multi-Processing)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SMP (Symmetric Multi-Processing) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β CPU 0 β β CPU 1 β β CPU 2 β β CPU 3 β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β ββ Cache ββ ββ Cache ββ ββ Cache ββ ββ Cache ββ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β β β β β β
β βββββββββββββ΄ββββββ¬ββββββ΄ββββββββββββ β
β β β
β System Bus β
β β β
β ββββββββββββββββββββββββ΄βββββββββββββββββββββββ β
β β β β
β β Shared Memory β β
β β (Equal access for all CPUs) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Characteristics: β
β - All processors have equal memory access (UMA) β
β - Any CPU can run same code β
β - Limited scalability (bus contention) β
β - Typically 2-8 CPUs β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.2 NUMA (Non-Uniform Memory Access)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NUMA Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Node 0 Node 1 β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β CPU0 CPU1 β β CPU2 CPU3 β β
β β βββββ βββββ β β βββββ βββββ β β
β β β $ β β $ β β β β $ β β $ β β β
β β βββ¬ββ βββ¬ββ β β βββ¬ββ βββ¬ββ β β
β β βββββ¬ββββ β β βββββ¬ββββ β β
β β β β β β β β
β β βββββββ΄ββββββ β β βββββββ΄ββββββ β β
β β β Local Mem β βββββββββββΆβ β Local Mem β β β
β β β (Fast) β βInterconnβ β (Fast) β β β
β β βββββββββββββ β β βββββββββββββ β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β
β Memory access time: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Local Memory (same node): ~100 cycles ββ
β β Remote Memory (other node): ~300 cycles (3x slower) ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Characteristics: β
β - Local memory access faster than remote memory β
β - Excellent scalability (hundreds of CPUs possible) β
β - Requires NUMA-aware programming β
β - Standard architecture for modern servers β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.3 Multicore Processor¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Modern Multicore CPU β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CPU Die β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Core 0 Core 1 Core 2 β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β β βL1-IβL1-Dβ βL1-IβL1-Dβ βL1-IβL1-Dβ β β β
β β β ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β β β L2 β β L2 β β L2 β β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βββββββββββββββββββββββββ΄ββββββββββββββββββββββββ β β
β β β Shared L3 Cache β β β
β β β (8-64 MB) β β β
β β βββββββββββββββββββββββββ¬ββββββββββββββββββββββββ β β
β β β β β
β β βββββββββββββββββββββββββ΄ββββββββββββββββββββββββ β β
β β β Memory Controller β β β
β β β + PCIe Controller β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Advantages: β
β - Fast inter-core communication (on-chip) β
β - Power efficient β
β - Efficient data sharing through shared cache β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.4 Hyper-Threading (SMT)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Simultaneous Multi-Threading (SMT) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Execute multiple threads on single physical core: β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Physical Core β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β Thread 0 State β Thread 1 State ββ β
β β β βββββββββββββ β βββββββββββββ ββ β
β β β βRegisters β β βRegisters β β Duplicated ββ β
β β β βPC, Stack β β βPC, Stack β ββ β
β β β βββββββββββββ β βββββββββββββ ββ β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β Shared Resources ββ β
β β β βββββββββββ βββββββββββ βββββββββββ ββ β
β β β β ALU β β Cache β β Branch β β Sharedββ β
β β β β β β β βPredictorβ ββ β
β β β βββββββββββ βββββββββββ βββββββββββ ββ β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Operation: β
β - When Thread 0 waits for memory β Thread 1 executes β
β - Improves execution unit utilization β
β - Typically 15-30% performance improvement β
β β
β OS view: β
β - 4-core 8-thread = recognized as 8 logical CPUs β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4. Cache Coherence Problem¶
4.1 What is Cache Coherence?¶
Problem scenario:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Core 0 Core 1 β
β Cache Cache β
β βββββββββββ βββββββββββ β
β β X = 10 β β X = 10 β β
β βββββββββββ βββββββββββ β
β β β β
β βββββββββββββββ¬ββββββββββββββββββββ β
β β β
β ββββββββ΄βββββββ β
β β Main Memory β β
β β X = 10 β β
β βββββββββββββββ β
β β
β 1. Initial state: X = 10 (all same) β
β β
β 2. Core 0 modifies X = 20: β
β β
β Core 0 Core 1 β
β Cache Cache β
β βββββββββββ βββββββββββ β
β β X = 20 β β Modified β X = 10 β β Stale! β
β βββββββββββ βββββββββββ β
β β
β 3. What value does Core 1 read for X? β
β - 10 (its own cache) β Stale value! (coherence violation)β
β - 20 (Core 0's value) β Coherence maintained β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.2 Coherence Definition¶
Cache Coherence conditions:
1. Program order preservation:
- Write followed by read on same processor returns written value
2. Consistent read values:
- Writes by other processors eventually visible to all processors
3. Write serialization:
- Writes to same address appear in same order to all processors
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Example: β
β β
β Initial: X = 0 β
β β
β Core 0: X = 1 β
β Core 1: X = 2 β
β β
β Order of X values seen must be same for all processors: β
β - All see 0 β 1 β 2 order, OR β
β - All see 0 β 2 β 1 order β
β β
β Some processors seeing 1 β 2 while others see 2 β 1 is invalidβ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.3 Coherence Protocol Overview¶
Coherence maintenance methods:
1. Snooping Protocol:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β - Suitable for bus-based systems β
β - Each cache monitors bus traffic (snooping) β
β - Takes appropriate action when relevant address detected β
β - MESI, MOESI, etc. β
β β
β βββββββ βββββββ βββββββ β
β βCacheβ βCacheβ βCacheβ β All monitor bus β
β ββββ¬βββ ββββ¬βββ ββββ¬βββ β
β β β β β
β ββββͺββββββββββββͺββββββββββββͺβββββ (Shared Bus) β
β β β
β ββββββββ΄βββββββ β
β β Memory β β
β βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Directory Protocol:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β - Good scalability (suitable for NUMA) β
β - Central directory tracks cache states β
β - Send messages only to relevant caches β
β β
β βββββββ βββββββ βββββββ β
β βCacheβ βCacheβ βCacheβ β
β ββββ¬βββ ββββ¬βββ ββββ¬βββ β
β β β β β
β ββββββββββ¬βββββββββ΄βββββββββ¬βββββββββ β
β β β β
β βββββββ΄ββββββ ββββββββ΄βββββββ β
β β Directory β β Memory β β
β β(State Trac)β β β β
β βββββββββββββ βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5. Snooping Protocol (MESI)¶
5.1 MESI States¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MESI Protocol States β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β M (Modified): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - This cache only has copy β β
β β - Modified (inconsistent with memory) β β
β β - Write-back needed on other cache access β β
β β - Write allowed β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β E (Exclusive): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - This cache only has copy β β
β β - Not modified (consistent with memory) β β
β β - Transition to M on write (no invalidation needed) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β S (Shared): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - Multiple caches may have copy β β
β β - Consistent with memory β β
β β - On write, invalidate other caches β M state β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β I (Invalid): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β - No valid data β β
β β - Cache line empty β β
β β - Cache miss on access β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.2 MESI State Transitions¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MESI State Transition Diagram β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββ β
β Read miss β β Read miss β
β (exclusive)β I β (shared) β
β ββββββββββ Invalid ββββββββββ β
β β β β β β
β β ββββββ¬βββββ β β
β β β β β
β β Writeβ β β
β β miss β β β
β β β β β
β βΌ β βΌ β
β ββββββββββββ β ββββββββββββ β
β β E β β β S β β
β βExclusive β β β Shared β β
β ββββββ¬ββββββ β ββββββ¬ββββββ β
β β β β β
β Localβ β Localβ β
β Writeβ β Writeβ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββ β
β β M β β
β β Modified β β
β βββββββββββββββββββββββββββββββββββ β
β β
β Key transitions: β
β - I β E: Read miss, not in other caches β
β - I β S: Read miss, exists in other caches (Shared state) β
β - E β M: Local write (no invalidation broadcast needed) β
β - S β M: Local write + invalidate other caches β
β - M/E/S β I: Invalidated by other cache's write β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.3 MESI Operation Example¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MESI Protocol Operation Example β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Initial: Variable X not in any cache (only in memory X=0) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 1: Core 0 reads X β β
β β β β
β β Core 0 Cache Core 1 Cache Memory β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β X=0 (E) β β X (I) β β X = 0 β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β Exclusive (not in other caches) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 2: Core 1 reads X β β
β β β β
β β Core 0 Cache Core 1 Cache Memory β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β X=0 (S) β β X=0 (S) β β X = 0 β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β EβS transition (read by other cache) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 3: Core 0 writes X = 10 β β
β β β β
β β Core 0: Broadcast invalidation message β β
β β Core 1: Change X to Invalid β β
β β β β
β β Core 0 Cache Core 1 Cache Memory β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β X=10(M) β β X (I) β β X = 0 β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β Modified (inconsistent with memory) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 4: Core 1 reads X β β
β β β β
β β Core 1: Read miss, Core 0 provides data β β
β β Core 0: Write-back to memory, transition to S β β
β β β β
β β Core 0 Cache Core 1 Cache Memory β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β X=10(S) β β X=10(S) β β X = 10 β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β MβS transition, memory updated β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.4 MOESI Protocol¶
MOESI = MESI + Owner state:
O (Owner):
- Sole owner of modified data
- Other caches may have Shared copies
- Inconsistent with memory (Owner has latest)
- Owner responds to other cache requests
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MOESI advantages: β
β - Increased cache-to-cache transfer efficiency β
β - Can delay write-back β
β - Mainly used in AMD processors β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6. Amdahl's Law and Gustafson's Law¶
6.1 Amdahl's Law¶
Law showing limits of parallelization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Amdahl's Law β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1 β
β Speedup = βββββββββββββββββββββββββ β
β (1 - P) + P/N β
β β
β P: Parallelizable fraction β
β N: Number of processors β
β β
β Example: P = 90% (90% parallelizable) β
β β
β N=2: Speedup = 1/(0.1 + 0.45) = 1.82x β
β N=4: Speedup = 1/(0.1 + 0.225) = 3.08x β
β N=8: Speedup = 1/(0.1 + 0.1125) = 4.71x β
β N=β: Speedup = 1/0.1 = 10x β Maximum limit! β
β β
β Even with 90% parallelization, max 10x speedup possible β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Speedup β β
β β β _______ P=99% β β
β β 100β _____/ β β
β β β ____/ β β
β β 80 β ____/ β β
β β β ____/ _______ P=95% β β
β β 60 β ____/ ____/ β β
β β β___/ ____/ β β
β β 40 β ____/ _______ P=90% β β
β β β ____/ ____/ β β
β β 20 β ____/ ____/ β β
β β β__/ ____/_______________________ P=75% β β
β β β ____/___________________________ β β
β β βββββββββββββββββββββββββββββββββββββββββββββ β β
β β 1 10 100 1000 10000 Number of processors β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6.2 Gustafson's Law¶
More parallelization possible by scaling problem size:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gustafson's Law β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Scaled Speedup = N + (1 - N) Γ S β
β β
β Or: β
β β
β Speedup = N - S Γ (N - 1) β
β β
β N: Number of processors β
β S: Sequential fraction (fixed time) β
β β
β Key idea: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Amdahl: "Fixed problem size, add processors" β β
β β β Sequential part becomes bottleneck β β
β β β β
β β Gustafson: "Fixed time, scale problem size" β β
β β β Solve larger problem in same time β β
β β β Parallel portion fraction increases β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Example: β
β - Run simulation in 1 hour β
β - Adding processors allows more detailed simulation β
β - Sequential part (initialization, etc.) stays constant β
β parallel computation volume increases β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6.3 Actual Parallelization Efficiency¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Parallelization Efficiency Calculation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Efficiency = Speedup / N β
β β
β Ideal: Efficiency = 1 (100%) β
β Realistic: Less than 1 due to overhead β
β β
β Overhead factors: β
β - Communication time (inter-processor data transfer) β
β - Synchronization wait time β
β - Load imbalance (unequal work distribution) β
β - Cache coherence traffic β
β - Memory contention β
β β
β Efficiency graph: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Efficiency β β
β β 100%ββββββ β β
β β β ββββββ β β
β β 80% β ββββββ β β
β β β ββββββ β β
β β 60% β ββββββ β β
β β β ββββββ β β
β β 40% β ββββββ β β
β β β βββββ β β
β β 20% β β β
β β ββββββββββββββββββββββββββββββββββββββββββ β β
β β 1 2 4 8 16 32 64 128 Number of processorsβ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Generally efficiency decreases as processor count increasesβ
β (Diminishing returns) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7. Synchronization and Locks¶
7.1 Need for Synchronization¶
Race Condition example:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared variable: counter = 0 β
β β
β Thread 0 Thread 1 β
β βββββββββββββββββ βββββββββββββββββ β
β load counter load counter // Both 0 β
β add 1 add 1 // Both 1 β
β store counter store counter // Both store 1β
β β
β Expected result: counter = 2 β
β Actual result: counter = 1 β One increment lost! β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.2 Atomic Operations¶
Hardware-supported atomic operations:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Test-and-Set (TAS): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β int TAS(int *lock) { β β
β β int old = *lock; // Read β β
β β *lock = 1; // Write Executed β β
β β return old; // Return atomically β β
β β } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Compare-and-Swap (CAS): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β bool CAS(int *addr, int expected, int new) { β β
β β if (*addr == expected) { β β
β β *addr = new; Executed β β
β β return true; atomically β β
β β } β β
β β return false; β β
β β } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Fetch-and-Add: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β int FAA(int *addr, int val) { β β
β β int old = *addr; Executed β β
β β *addr = old + val; atomically β β
β β return old; β β
β β } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β x86 instruction examples: β
β - LOCK XCHG (Test-and-Set) β
β - LOCK CMPXCHG (Compare-and-Swap) β
β - LOCK XADD (Fetch-and-Add) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.3 Spinlock¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Spinlock Implementation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Simple spinlock (Test-and-Set): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β void lock(int *lock) { β β
β β while (TAS(lock) == 1) { β β
β β // Spin (busy wait) β β
β β } β β
β β } β β
β β β β
β β void unlock(int *lock) { β β
β β *lock = 0; β β
β β } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Problem: Bus traffic on every TAS β
β β
β Test-and-Test-and-Set (TTAS): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β void lock(int *lock) { β β
β β while (1) { β β
β β while (*lock == 1) { β β
β β // Spin in local cache (no bus traffic)β β
β β } β β
β β if (TAS(lock) == 0) { β β
β β return; // Lock acquired β β
β β } β β
β β } β β
β β } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Improvement: Wait in local cache until lock released β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.4 Mutex and Semaphore¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Synchronization Primitives β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Mutex: β
β - Mutual Exclusion β
β - Only one thread allowed at a time β
β - Has ownership (only acquiring thread can release) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β pthread_mutex_t mutex; β β
β β pthread_mutex_init(&mutex, NULL); β β
β β β β
β β pthread_mutex_lock(&mutex); β β
β β // Critical Section β β
β β pthread_mutex_unlock(&mutex); β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Semaphore: β
β - Counter-based synchronization β
β - Can allow N threads simultaneous access β
β - No ownership β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β sem_t sem; β β
β β sem_init(&sem, 0, 3); // Max 3 concurrent accessβ β
β β β β
β β sem_wait(&sem); // Decrement counter, wait if 0β β
β β // Critical Section β β
β β sem_post(&sem); // Increment counter, wake waitersβ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.5 Lock-Free Algorithms¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Lock-Free Counter Example β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β void increment(atomic_int *counter) { β
β int old, new; β
β do { β
β old = atomic_load(counter); β
β new = old + 1; β
β } while (!atomic_compare_exchange(counter, &old, new));β
β } β
β β
β Operation: β
β 1. Read current value β
β 2. Calculate new value β
β 3. Attempt atomic update with CAS β
β 4. Retry if failed (another thread modified it) β
β β
β Advantages: β
β - No lock waiting β
β - Deadlock impossible β
β - No priority inversion β
β β
β Disadvantages: β
β - Complex implementation β
β - Must be careful of ABA problem β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8. GPU and Parallel Computing¶
8.1 GPU Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU vs CPU Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CPU (Latency Optimized): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Large Cache β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β βββββββββββββββ βββββββββββββββ β β
β β β Complex β β Complex β β β
β β β Core 0 β β Core 1 β (4-16 cores) β β
β β β (OoO, BP) β β (OoO, BP) β β β
β β βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Features: Complex cores, large cache, low latency β
β β
β GPU (Throughput Optimized): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬βββββ β
β β β S β S β S β S β S β S β S β S β S β S β S β S ββ β
β β β M β M β M β M β M β M β M β M β M β M β M β M ββ β
β β βββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄βββββ β
β β (Thousands of simple cores/CUDA Cores) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Small Cache per SM β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Features: Thousands of simple cores, small cache, high throughputβ
β β
β SM (Streaming Multiprocessor) structure: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 32 CUDA Cores (1 Warp = 32 threads) β β β
β β β Each core executes same instruction (SIMT) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β + Shared Memory (48KB) β β
β β + Register File (65536 Γ 32bit) β β
β β + Warp Scheduler β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8.2 CUDA Programming Model¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CUDA Execution Model β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Hierarchy: β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Grid β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Block β β Block β β Block β β Block β β β
β β β (0,0) β β (1,0) β β (2,0) β β (3,0) β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Block β β Block β β Block β β Block β β β
β β β (0,1) β β (1,1) β β (2,1) β β (3,1) β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Inside Block: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬βββββββ β
β β β T0 β T1 β T2 β T3 β T4 β T5 β ... βT255 ββ β
β β βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄βββββββ β
β β β β
β β 256 Threads sharing Shared Memory β β
β β Synchronization: __syncthreads() β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Memory hierarchy: β
β - Global Memory: All threads access, slow (~500 cycles) β
β - Shared Memory: Shared within block, fast (~5 cycles) β
β - Register: Thread-private, fastest β
β - Constant Memory: Read-only, cached β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8.3 CUDA Code Example¶
// Vector addition CUDA kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
// Calculate global thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1000000;
size_t size = N * sizeof(float);
// Allocate host memory
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
// Allocate device memory
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Copy data (Host β Device)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Kernel execution configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// Execute kernel
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result (Device β Host)
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C);
return 0;
}
8.4 GPU vs CPU Use Cases¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU vs CPU Suitable Tasks β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β GPU suitable (Data parallel): β
β - Matrix operations β
β - Image/video processing β
β - Deep learning training/inference β
β - Physics simulation β
β - Cryptocurrency mining β
β - Scientific computing (CFD, molecular dynamics) β
β β
β CPU suitable (Task parallel, complex control flow): β
β - Operating systems β
β - Databases β
β - Web servers β
β - Compilers β
β - General applications β
β β
β Performance comparison (approximate): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Task β CPU β GPU β Ratio β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β Matrix mult (4KΓ4K)β 10s β 0.1s β 100x β β
β β Image filter β 2s β 0.05s β 40x β β
β β Neural net trainingβ 100s β 2s β 50x β β
β β Sorting algorithm β 5s β 4s β 1.25x β β
β β Branch-heavy code β 1s β 10s β 0.1x β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9. Practice Problems¶
Basic Problems¶
-
Explain the four classifications of Flynn's Taxonomy.
-
What is the difference between SMP and NUMA?
-
Explain the four states of the MESI protocol.
Intermediate Problems¶
-
When 80% of a program is parallelizable, what is the maximum performance improvement on an 8-core system according to Amdahl's Law?
-
Explain the Race Condition that can occur in the following code and provide a solution:
c int counter = 0; void increment() { counter++; } -
Explain the difference between Test-and-Set and Compare-and-Swap.
Advanced Problems¶
- Track the state transitions in the MESI protocol for the following scenario:
- Core 0 reads X (from memory)
- Core 1 reads X
- Core 0 writes X
-
Core 1 reads X
-
Explain why GPUs are faster than CPUs for matrix multiplication.
-
Explain the advantages and disadvantages of Lock-Free algorithms and the ABA problem.
Answers
1. Flynn's Taxonomy: - SISD: Single Instruction Single Data (traditional CPU) - SIMD: Single Instruction Multiple Data (vector operations, GPU) - MISD: Multiple Instruction Single Data (rarely used) - MIMD: Multiple Instruction Multiple Data (multicore) 2. SMP vs NUMA: - SMP: All CPUs have uniform memory access (UMA) - NUMA: Local memory access faster than remote, better scalability 3. MESI states: - Modified: Modified, sole copy - Exclusive: Not modified, sole copy - Shared: Not modified, multiple copies possible - Invalid: Invalid 4. Amdahl's Law calculation: Speedup = 1 / (0.2 + 0.8/8) = 1 / (0.2 + 0.1) = 1/0.3 = 3.33x 5. Race Condition solution: - Problem: counter++ is not atomic (read-modify-write) - Solution: Use mutex, or use atomic operation (atomic_fetch_add) 6. TAS vs CAS: - TAS: Always sets to 1, returns previous value - CAS: Sets to new value only if matches expected value 7. MESI state transitions: - Core 0 read: Core0=E, Core1=I - Core 1 read: Core0=S, Core1=S - Core 0 write: Core0=M, Core1=I (invalidated) - Core 1 read: Core0=S, Core1=S (Core0 provides data) 8. Why GPUs are faster for matrix multiplication: - High data parallelism (each element computed independently) - Thousands of cores executing simultaneously - High memory bandwidth - Matrix multiplication optimized for GPU architecture 9. Lock-Free algorithms: - Advantages: No lock waiting, deadlock impossible - Disadvantages: Complex implementation, difficult debugging - ABA problem: Value changes AβBβA, CAS still succeeds - Solution: Use tag/version counter, Hazard PointersReferences¶
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
- The Art of Multiprocessor Programming (Herlihy & Shavit)
- NVIDIA CUDA Programming Guide
- Intel Optimization Manual
- Memory Consistency Models