Download presentation
Presentation is loading. Please wait.
Published byLoreen Porter Modified over 9 years ago
1
CPS 258 Announcements http://www.cs.duke.edu/~nikos/cps258 –Lecture calendar with slides –Pointers to related material
2
Parallel Architectures (continued)
3
Parallelism Levels Job Program Instruction Bit
4
Parallel Architectures Pipelining Multiple execution units –Superscalar –VLIW Multiple processors
5
Pipelining Example Load UALUStore load x(i) load y(i) load x(i+1)add z(i),x(i),y(i) load xy(i+1) store z(i) add z(i+1),x(i+1),y(i+1) store z(i+1) Prologue Loop body Epilogue for i = 1:n z(i) = x(i) + y(i); end
6
Generic Computer CPU Memory Bus
7
Memory Organization Distributed memory Shared memory
8
Shared Memory
9
Distributed Memory
10
Interleaved Memory
11
Network Topologies Ring Torus Tree Star Hypercube Cross-bar
12
Flynn’s Taxonomy SISD SIMD MISD MIMD
13
Programming Modes Data Parallel Message Passing Shared Memory Multithreaded (control parallelism)
14
Instruction Processing Stages Fetch Decode Execute Post
15
Performance measures FLOPS –Theoretical vs actual –MFLOPS, GFLOPS, TFLOPS Speedup(P) = Execution time in 1 proc/time in P procs Benchmarks –LINPACK –LAPACK –SPEC (System Performance Evaluation Cooperative)
16
Speedup Speedup(P) = Best execution time in 1 proc/time in P procs Parallel Efficiency(P) = Speedup(P)/P
17
Example Suppose a program runs in 10sec and 80% of the time is spent in subroutine F that can be perfectly parallelized. What is the best speedup I can achieve?
18
Amdahl’s Law Speedup is limited by the percentage of the code that has to be executed sequentially
19
“Secrets” to Success Overlap communication with computation Communicate minimally Avoid synchronizations T = t comp + t comm + t sync
20
Processors CISC –Many and complex/multicycle instructions –Few registers –Direct access to memory RISC –Few “orthogonal” instructions –Large register files –Access to memory only through L/S units
21
Common μProcessors Intel X86 Advanced Micro Devices Transmeta Crusoe PowerPC SPARC MIPS
22
Cache Memory Hierarchies Memory speed progresses much slower than processor speed Memory Locality –Spatial –Temporal Data Placement –Direct mapping –Set associative Data Replacement
23
Example Matrix multiplication –As dot products –As sub-matrix products
24
Vector Architectures Single Instruction Multiple Data Exploit uniformity of operations Multiple execution units Pipelining Hardware assisted loops Vectorizing compilers
25
Compiler techniques for vectorization Scalar expansion Statement reordering Loop –Distribution –Reordering –Merging –Splitting –Skewing –Unrolling –Peeling –Collapsing
26
Epilogue Distributed memory systems win Memory hierarchy is critical in performance Compilers do a good job in ILP but programmers are still important System modeling inadequate to help us tune optimal performance
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.