CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

CPS 258 Announcements http://www.cs.duke.edu/~nikos/cps258 –Lecture calendar with slides –Pointers to related material

Parallel Architectures (continued)

Parallelism Levels Job Program Instruction Bit

Parallel Architectures Pipelining Multiple execution units –Superscalar –VLIW Multiple processors

Pipelining Example Load UALUStore load x(i) load y(i) load x(i+1)add z(i),x(i),y(i) load xy(i+1) store z(i) add z(i+1),x(i+1),y(i+1) store z(i+1) Prologue Loop body Epilogue for i = 1:n z(i) = x(i) + y(i); end

Generic Computer CPU Memory Bus

Memory Organization Distributed memory Shared memory

Shared Memory

Distributed Memory

Interleaved Memory

Network Topologies Ring Torus Tree Star Hypercube Cross-bar

Flynn’s Taxonomy SISD SIMD MISD MIMD

Programming Modes Data Parallel Message Passing Shared Memory Multithreaded (control parallelism)

Instruction Processing Stages Fetch Decode Execute Post

Performance measures FLOPS –Theoretical vs actual –MFLOPS, GFLOPS, TFLOPS Speedup(P) = Execution time in 1 proc/time in P procs Benchmarks –LINPACK –LAPACK –SPEC (System Performance Evaluation Cooperative)

Speedup Speedup(P) = Best execution time in 1 proc/time in P procs Parallel Efficiency(P) = Speedup(P)/P

Example Suppose a program runs in 10sec and 80% of the time is spent in subroutine F that can be perfectly parallelized. What is the best speedup I can achieve?

Amdahl’s Law Speedup is limited by the percentage of the code that has to be executed sequentially

“Secrets” to Success Overlap communication with computation Communicate minimally Avoid synchronizations T = t comp + t comm + t sync

Processors CISC –Many and complex/multicycle instructions –Few registers –Direct access to memory RISC –Few “orthogonal” instructions –Large register files –Access to memory only through L/S units

Common μProcessors Intel X86 Advanced Micro Devices Transmeta Crusoe PowerPC SPARC MIPS

Cache Memory Hierarchies Memory speed progresses much slower than processor speed Memory Locality –Spatial –Temporal Data Placement –Direct mapping –Set associative Data Replacement

Example Matrix multiplication –As dot products –As sub-matrix products

Vector Architectures Single Instruction Multiple Data Exploit uniformity of operations Multiple execution units Pipelining Hardware assisted loops Vectorizing compilers

Compiler techniques for vectorization Scalar expansion Statement reordering Loop –Distribution –Reordering –Merging –Splitting –Skewing –Unrolling –Peeling –Collapsing

Epilogue Distributed memory systems win Memory hierarchy is critical in performance Compilers do a good job in ILP but programmers are still important System modeling inadequate to help us tune optimal performance

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Similar presentations

Presentation on theme: "CPS 258 Announcements –Lecture calendar with slides –Pointers to related material."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Similar presentations

Presentation on theme: "CPS 258 Announcements –Lecture calendar with slides –Pointers to related material."— Presentation transcript:

Similar presentations

About project

Feedback