Download presentation
Presentation is loading. Please wait.
Published byToby Roland Blake Modified over 6 years ago
1
The Coming Multicore Revolution: What Does it Mean for Programming?
Kathy Yelick NERSC Director, LBNL EECS Professor, U.C. Berkeley
2
Software Issues at Scale
Power concerns will dominates all others; Concurrency is the most significant knob we have: lower clock, increase parallelism Power density and facility energy Summary Issues for Software 1EF system: Billion-way concurrency, O(1K) cores per chip 1 PF system: millions of threads and O(1K) cores per chip The memory capacity/core ratio may drop significantly Faults will become more prevalent Flops are cheap relative to data movement “Core” is a retro term: “sea of functional units” 1 thread per functional unit Many functional units per thread (SIMD, vectors) Many threads per functional unit (Niagra)
3
DRAM component density is only doubling every 3 years
Source: IBM 3 1 Mayl 2008 Sequoia Programming Models
4
Two Parallel Language Questions
What is the parallel control model? What is the model for sharing/communication? implied synchronization for message passing, not shared memory data parallel (singe thread of control) dynamic threads single program multiple data (SPMD) receive store send load shared memory message passing
5
Sharing / Communication Models
6
What’s Wrong with MPI Everywhere
We can run 1 MPI process per core This works now (for CMPs) and will work for a while But 100x faster bandwidth and lower latency within a chip How long will it continue working? 4 - 8 cores? Probably cores? Probably not. Depends on bandwidth and memory scaling What is the problem? Latency: some copying required by semantics Memory utilization: partitioning data for separate address space requires some replication How big is your per core subgrid? At 10x10x10, over 1/2 of the points are surface points, probably replicated Memory bandwidth: extra state means extra bandwidth Weak scaling: success model for the “cluster era;” will not be for the many core era -- not enough memory per core Heterogeneity: MPI per CUDA thread-block? Advantage: no new apps work; modest infrastructure work (multicore-optimized MPI)
7
What about Mixed MPI and Threads?
Threads: OpenMP, PThreads,… Problems Will OpenMP performance scale with the number of cores / chip? More investment in infrastructure than MPI, but can leverage existing technology Do people want two programming models? Doesn’t go far enough Thread context is a large amount of state compared to vectors/streams Op per instruction vs. 64 ops per instruction
8
PGAS Languages: Why use 2 Programming Models when 1 will do?
Global address space: thread may directly read/write remote data Partitioned: data is designated as local or global x: 1 y: x: 5 y: x: 7 y: 0 Global address space l: l: l: g: g: g: p0 p1 pn Too much time on this slide and next 3 Don’t talk about malloc What about clusters of SMPs Process, processor, thread Too much talking relative to slide Remote put and get: never have to say “receive” Remote function invocation? See HPCS languages No less scalable than MPI! (see previous talks) Permits sharing, whereas MPI rules it out! One model rather than two, but if you insist on two: Can call UPC from MPI and vice verse (tested and used)
9
Sharing and Communication Models: PGAS vs. MPI
host CPU two-sided message message id data payload network interface one-sided put message memory address data payload A one-sided put/get message can be handled directly by a network interface with RDMA support Avoid interrupting the CPU or storing data from CPU (preposts) A two-sided messages needs to be matched with a receive to identify memory address to put data Offloaded to Network Interface in networks like Quadrics Need to download match tables to interface (from host) Joint work with Dan Bonachea
10
Performance Advantage of One-Sided Communication
The put/get operations in PGAS languages (remote read/write) are one-sided (no required interaction from remote proc) This is faster for pure data transfers than two-sided send/receive
11
3D FFT on BlueGene/P
12
PGAS Languages for Multicore
DMA PGAS languages are a good fit to shared memory machines, including multicore Global address space implemented as reads/writes Also may be exploited for processor with explicit local store rather than cache, e.g., Cell, GPUs,… Open question in architecture Cache-coherence shared memory Software-controlled local memory (or hybrid) Private on-chip l: m: x: 1 y: x: 5 x: 7 y: 0 Shared partitioned on-chip Shared off-chip DRAM
13
Parallelism / Control Models
14
To Virtualize or Not The fundamental question facing in parallel programming models is: What should be virtualized? Hardware has finite resources Processor count is finite Registers count is finite Fast local memory (cache and DRAM) size is finite Links in network topology are generally < n2 Does the programming model (language+libraries) expose this or hide it? E.g., one thread per core, or many? Many threads may have advantages for load balancing, fault tolerance and latency-hiding But one thread is better for deep memory hierarchies How to get the most out of your machine?
15
Avoid Global Synchronization
Bulk-synchronous programming has too much synchronization Bad for performance Linpack performance On Multicore / SMP (left, Dongarra et al) and distributed memory (right, UPC) Also bad direction for fault tolerance Dynamic load balancing: use it when the problem demands it, not gratuitously
16
Lessons Learned One-sided communication is faster than 2-sided
All about overlap and overhead (FFT example) Global address space can ease programming Dynamic scheduling can tolerate latencies More adaptive and easier to use (fewer knobs) Principled scheduling that takes into account Critical Path, Memory use, Cache, etc. Combination of dynamic load balancing with locality control has new challenges Previous work solve load balancing (Cilk) or locality (MPI) but not both together
17
Autotuning: Extreme Performance Programming
Automatic performance tuning Use machine time in place of human time for tuning Search over possible implementations Use performance models to restrict search space Programmers should write programs to generate code, not the code itself Autotuning finds a good performance solution be heuristics or exhaustive search Perl script generates many versions Generate SIMD-optimized kernels Autotuner anlyzes/runs kernels Uses search and heuristics PERI SciDAC is including some of these ideas into compilers and domain-specific code generators libraries, e.g., OSKI for sparse matrices Multicore optimizations not yet in OSKI release
18
Stencil Code Overview 3D, 7-point, Jacobi iteration on a 2563 grid
void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } 3D, 7-point, Jacobi iteration on a 2563 grid Flop:Byte Ratio: 0.33 (write allocate), 0.5 (Ideal)
19
Naive Stencil Performance
Intel Xeon (Clovertown) AMD Opteron (Barcelona) Sun Niagara2 (Victoria Falls) For this simple code - all cache-based platforms show poor efficiency and scalability Could lead programmer to believe that approaching a resource limit
20
Fully-Tuned Performance
Intel Xeon (Clovertown) AMD Opteron (Barcelona) Sun Niagara2 (Victoria Falls) Optimizations include: NUMA-Aware Padding Unroll/ Reordering Thread/ Cache Blocking Prefetching SIMDization Cache Bypass 1.9x 5.4x 12.5x Different optimizations have dramatic effects on different architectures Largest optimization benefit seen for the largest core count
21
Stencil Results Single Precision Double Precision Performance
Power Efficiency
22
Autotuning gives better performance… but is it good performance?
23
The Roofline Performance Model
128 64 32 16 attainable Gflop/s 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio
24
The Roofline Performance Model
Log scale ! 128 64 32 16 attainable Gflop/s 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio
25
The Roofline Performance Model
0.5 1.0 1/8 actual flop:byte ratio attainable Gflop/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Generic Machine The top of the roof is determined by peak computation rate (Double Precision floating point, DP for these algorithms) The instruction mix, lack of SIMD operations, ILP or failure to use other features of peak will lower attainable peak DP mul / add imbalance w/out SIMD w/out ILP
26
The Roofline Performance Model
0.5 1.0 1/8 actual flop:byte ratio attainable Gflop/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Generic Machine The sloped part of the roof is determined by peak DRAM bandwidth (STREAM) Lack of proper prefetch, ignoring NUMA, or other things will reduce attainable bandwidth peak DP mul / add imbalance w/out SIMD peak BW w/out ILP w/out SW prefetch w/out NUMA
27
The Roofline Performance Model
0.5 1.0 1/8 actual flop:byte ratio attainable Gflop/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Generic Machine Locations of posts in the building are determined by algorithmic intensity Will vary across algorithms and with bandwidth-reducing optimizations, such as better cache re-use (tiling), compression techniques peak DP mul / add imbalance w/out SIMD peak BW w/out ILP w/out SW prefetch w/out NUMA
28
Naïve Roofline Model Unrealistically optimistic model
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Unrealistically optimistic model Hand optimized Stream BW benchmark peak DP peak DP 64 64 32 32 16 16 attainable Gflop/s attainable Gflop/s 8 hand optimized Stream BW hand optimized Stream BW 8 4 4 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 Gflop/s(AI) = min Peak Gflop/s StreamBW * AI flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP hand optimized Stream BW hand optimized Stream BW 16 16 attainable Gflop/s attainable Gflop/s 8 8 4 4 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
29
Roofline model for Stencil (out-of-the-box code)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
30
Roofline model for Stencil (out-of-the-box code)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
31
Roofline model for Stencil (out-of-the-box code)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
32
Roofline model for Stencil (NUMA, cache blocking, unrolling, prefetch, …)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3 Clovertown has huge caches but is pinned to lower BW ceiling Cache management is essential when capacity/thread is low peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
33
Roofline model for Stencil (SIMDization + cache bypass)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Make SIMDization explicit Use cache bypass instruction: movntpd Increases flop:byte ratio to ~0.5 on x86/Cell peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
34
Autotuning: Extreme Performance Programming
Automatic performance tuning Use machine time in place of human time for tuning Search over possible implementations Use performance models to restrict search space Programmers should write programs to generate code, not the code itself Autotuning finds a good performance solution be heuristics or exhaustive search Perl script generates many versions Generate SIMD-optimized kernels Autotuner anlyzes/runs kernels Uses search and heuristics PERI SciDAC is including some of these ideas into compilers and domain-specific code generators libraries, e.g., OSKI for sparse matrices Block size (n0 x m0) for dense matrix-matrix multiply
35
Rethinking Algorithms Count data movement, not Flops
36
Latency and Bandwidth-Avoiding
Communication is limiting factor for scalability Movement of data within memory hierarchy Movement of data between cores/nodes Two aspects of communication Latency: number of discrete events (cache misses, messages) Bandwidth: volume of data moved More efficient algorithms tend to run less efficiently Algorithm efficiency: E.g., O(n) ops on O(n) data Efficiency of computation on machine: % of peak At scale, cannot afford to be algorithmically inefficient Need to run low compute intensity algorithms well, or flops are wasted Extends to preconditioning for either sparse or hierarchically semi-separable preconditioners Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
37
Avoiding Communication in Sparse Iterative Solvers
Consider Sparse Iterative Methods for Ax=b Use Krylov Subspace Methods like GMRES, CG Can we lower the communication costs? Latency of communication, i.e., reduce # messages by computing Ak*x with one read of remote x Bandwidth to memory hierarchy, i.e., compute A Example: Inner loop is sparse matrix-vector multiply, Ax (=nearest neighbor computation on a graph) Partition into cache blocks or by processor, and take multiple steps Simple examples, A is matrix of: For simplicity of presentation only: 1D mesh mesh “3 point stencil” Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
38
Locally Dependent Entries for [x,Ax], A tridiagonal
2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
39
Locally Dependent Entries for [x,Ax,A2x], A tridiagonal
2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
40
Avoiding Communication in Sparse Linear Algebra - Summary
Take k steps of Krylov subspace method GMRES, CG, Lanczos, Arnoldi Assume matrix “well-partitioned,” with modest surface-to-volume ratio Parallel implementation Conventional: O(k log p) messages “New”: O(log p) messages - optimal Serial implementation Conventional: O(k) moves of data from slow to fast memory “New”: O(1) moves of data – optimal Can incorporate some preconditioners Hierarchical, semiseparable matrices … Lots of speed up possible (modeled and measured) Price: some redundant computation
41
Why all our problems are solved for dense linear algebra– in theory
Thm (Demmel, Dumitriu, Holtz, Kleinberg) (Numer.Math. 2007) Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0. Current record: 2.38 Thm (Demmel, Dumitriu, Holtz) (Numer. Math. 2008) Given any stable matmul running in O(n+) ops, can do backward stable dense linear algebra in O(n+) ops: GEPP, QR (recursive) rank revealing QR (randomized) (Generalized) Schur decomposition, SVD (randomized) Also reduces communication to O(n+) But constants?
42
How to Waste an Exascale Machine
Ignore Little’s Law (waste bandwidth) Over-synchronize (unnecessary barriers) Over-synchronize communication (two-sided vs. one-sided) Waste bandwidth: ignore locality Use algorithms that minimize flops rather than data movement Add a high-overhead runtime system when you don’t need it
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.