Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Similar presentations


Presentation on theme: "CPS 258 Announcements –Lecture calendar with slides –Pointers to related material."— Presentation transcript:

1 CPS 258 Announcements http://www.cs.duke.edu/~nikos/cps258 –Lecture calendar with slides –Pointers to related material

2 Parallel Architectures (continued)

3 Parallelism Levels Job Program Instruction Bit

4 Parallel Architectures Pipelining Multiple execution units –Superscalar –VLIW Multiple processors

5 Pipelining Example Load UALUStore load x(i) load y(i) load x(i+1)add z(i),x(i),y(i) load xy(i+1) store z(i) add z(i+1),x(i+1),y(i+1) store z(i+1) Prologue Loop body Epilogue for i = 1:n z(i) = x(i) + y(i); end

6 Generic Computer CPU Memory Bus

7 Memory Organization Distributed memory Shared memory

8 Shared Memory

9 Distributed Memory

10 Interleaved Memory

11 Network Topologies Ring Torus Tree Star Hypercube Cross-bar

12 Flynn’s Taxonomy SISD SIMD MISD MIMD

13 Programming Modes Data Parallel Message Passing Shared Memory Multithreaded (control parallelism)

14 Instruction Processing Stages Fetch Decode Execute Post

15 Performance measures FLOPS –Theoretical vs actual –MFLOPS, GFLOPS, TFLOPS Speedup(P) = Execution time in 1 proc/time in P procs Benchmarks –LINPACK –LAPACK –SPEC (System Performance Evaluation Cooperative)

16 Speedup Speedup(P) = Best execution time in 1 proc/time in P procs Parallel Efficiency(P) = Speedup(P)/P

17 Example Suppose a program runs in 10sec and 80% of the time is spent in subroutine F that can be perfectly parallelized. What is the best speedup I can achieve?

18 Amdahl’s Law Speedup is limited by the percentage of the code that has to be executed sequentially

19 “Secrets” to Success Overlap communication with computation Communicate minimally Avoid synchronizations T = t comp + t comm + t sync

20 Processors CISC –Many and complex/multicycle instructions –Few registers –Direct access to memory RISC –Few “orthogonal” instructions –Large register files –Access to memory only through L/S units

21 Common μProcessors Intel X86 Advanced Micro Devices Transmeta Crusoe PowerPC SPARC MIPS

22 Cache Memory Hierarchies Memory speed progresses much slower than processor speed Memory Locality –Spatial –Temporal Data Placement –Direct mapping –Set associative Data Replacement

23 Example Matrix multiplication –As dot products –As sub-matrix products

24 Vector Architectures Single Instruction Multiple Data Exploit uniformity of operations Multiple execution units Pipelining Hardware assisted loops Vectorizing compilers

25 Compiler techniques for vectorization Scalar expansion Statement reordering Loop –Distribution –Reordering –Merging –Splitting –Skewing –Unrolling –Peeling –Collapsing

26 Epilogue Distributed memory systems win Memory hierarchy is critical in performance Compilers do a good job in ILP but programmers are still important System modeling inadequate to help us tune optimal performance


Download ppt "CPS 258 Announcements –Lecture calendar with slides –Pointers to related material."

Similar presentations


Ads by Google