Computer Organization CS224 Fall 2012 Lesson 52
Introduction Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability, availability, power efficiency Job-level (process-level) parallelism l High throughput for independent jobs Parallel processing program l Single program run on multiple processors Multicore microprocessors l Chips with multiple processors (cores) §9.1 Introduction
Types of Parallelism Data-Level Parallelism (DLP) Time Thread-Level Parallelism (TLP) Time Instruction-Level Parallelism (ILP) Pipelining Time
Hardware and Software Hardware l Serial: e.g., Pentium 4 l Parallel: e.g., quad-core Xeon e5345 Software l Sequential: e.g., matrix multiplication l Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware l Challenge: making effective use of parallel hardware
What We’ve Already Covered §2.11: Parallelism and Instructions l Synchronization §3.6: Parallelism and Computer Arithmetic l Associativity §4.10: Parallelism and Advanced Instruction-Level Parallelism §5.8: Parallelism and Memory Hierarchies l Cache Coherence (actually, we skipped this) §6.9: Parallelism and I/O: l Redundant Arrays of Inexpensive Disks
Parallel Programming Parallel software is the problem Need to get significant performance improvement l Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties l Partitioning l Coordination l Communications overhead §7.2 The Difficulty of Creating Parallel Processing Programs
Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? l T new = T parallelizable /100 + T sequential l l Solving: F parallelizable = Need sequential part to be 0.1% of original time (99.9% needs to be parallelizable) Obviously, less-than-expected speedups are common!
Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum l Speed up from 10 to 100 processors Single processor: Time = ( ) × t add 10 processors l Time = 10 × t add + 100/10 × t add = 20 × t add l Speedup = 110/20 = 5.5 (55% of potential) 100 processors l Time = 10 × t add + 100/100 × t add = 11 × t add l Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processors
Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = ( ) × t add 10 processors l Time = 10 × t add /10 × t add = 1010 × t add l Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors l Time = 10 × t add /100 × t add = 110 × t add l Speedup = 10010/110 = 91 (91% of potential) Assuming load is balanced
Strong vs Weak Scaling Strong scaling: problem size fixed l As in previous example Weak scaling: problem size proportional to number of processors l 10 processors, 10 × 10 matrix -Time = 20 × t add l 100 processors, 32 × 32 matrix -Time = 10 × t add /100 × t add = 20 × t add l Constant performance in this example
Shared Memory SMP: shared memory multiprocessor l Hardware provides single physical address space for all processors l Synchronize shared variables using locks l Memory access time -UMA (uniform) vs. NUMA (nonuniform) §7.3 Shared Memory Multiprocessors
Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA l Each processor has ID: 0 ≤ Pn ≤ 99 l Partition 1000 numbers per processor l Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums l Reduction: divide and conquer l Half the processors add pairs, then quarter, … l Need to synchronize between reduction steps
Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);