The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?
Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable Adaptive Distributed systems Lab (RAD Lab) U.C. Berkeley

Outline What Caused the Revolution? Is it Too Late to Stop It?
Projected Hardware/Software Context? Why Might We Succeed (this time)? Example Coordinated Attack: Par UCB Roofline: An Insightful Visual Performance Model Conclusion

Why Multicore Performance Model?
No consensus on multicore architecture Number cores, thick vs. thin, SIMD/Vector or not, Caches vs. Local Stores, homogeneous or not, … Must Programmer become expert on application and all new computers to deliver good performance on multicore? If don’t care about performance, why multicore? 1% programmers know both?

Multicore SMPs (All Dual Sockets)
Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar (each direction) 4GB/s 2.33 GHz, 8 Fat cores, Memory Bus 2.30 GHz, 8 Fat cores Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE 512K L2 SPE 256K MFC (each direction) <20GB/s 1.17 GHz, 16 thin 8-way MT cores 3.20 GHz, 16 thin SIMD cores

Assumptions for New Model
Focus on 13 Dwarfs Use floating point versions here; others in future Bound and Bottleneck Model good enough; e.g. Amdahl’s Law Don’t need accuracy to 10% to understand which optimizations to try to get next level of performance Can use SPMD programming model so parallelization, load balancing not issues Have looked at accomodating load balancing, parallelization, but not shown today

Bounds/Bottlenecks? For Floating Point Dwarfs: Peak Floating Point Performance Bound (Talk about non-floating point dwarfs later) For Dwarfs that don’t fit all in cache: DRAM Memory Performance Bound Operational Intensity: Average number of Floating Point Operations per Byte to DRAM Between cache and DRAM vs. processor and cache Varies by multicore design (cache org.) and dwarf

Can Graph Performance Model?
For Floating Point Dwarfs, Y axis is performance or GFLOPs/second Log Scale

Can Graph Performance Model?
Suppose X-axis was Operational Intensity? FLOPs per Bytes to DRAM Log scale Can plot Memory BW Bound (GBytes/sec) since GFLOPs/sec) (FLOPs/Byte) “Roofline” = GBytes/sec

Roofline Visual Performance Model
“Ridge Point”: minimum Operational Intensity to get Peak Performance Operational Intensity is avg. FLOPs/Byte per dwarf Compute Bound? Memory Bound? Ridge Point What do Real Rooflines Look Like? Where do Real Dwarfs map on Roofline?

Roofline Models of Real Multicores
128 Intel Clovertown Peak: 75 GFLOPS Ridge Point: 6.7 AMD Barcelona Peak: 74 GFLOPS Ridge Point: 4.4 IBM Cell Blade Peak: 29 GFLOPS Ridge Point: 0.65 Sun Victoria Falls Peak: 19 GFLOPS Ridge Point: 0.33 Clovertown 64 Barcelona 32 Cell Blade peak DP 16 Victoria Falls attainable Gflop/s 25% FP 8 w/out SW prefetch w/out NUMA 12% FP 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio

Roofline Models of Real Multicores
128 Intel Clovertown Peak: 75 GFLOPS Ridge Point: 6.7 AMD Barcelona Peak: 74 GFLOPS Ridge Point: 4.4 IBM Cell Blade Peak: 29 GFLOPS Ridge Point: 0.65 Sun Victoria Falls Peak: 19 GFLOPS Ridge Point: 0.33 Clovertown 64 Barcelona 32 Cell Blade peak DP 16 Victoria Falls attainable Gflop/s 25% FP 8 w/out SW prefetch w/out NUMA 12% FP 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio Op. Int. 1/4 1/2 1.07 1.64 dwarf SpMV Stencil LBMHD 3-D FFT

What if Performance < Roofline?
Measure computational and memory optimizations in advance Order computation optimizations and memory optimizations as “ceilings” below Roofline Need to perform optimization to break thru ceiling Use Operational Intensity to pick whether do memory or computation optimization (or both)

Adding Computational Ceilings
128 Has separate multipliers and adders: * = + ? SIMD? (2 Flops/Instr) 4 instructions per clock cycle? mul / add imbalance 64 32 w/out SIMD 16 attainable Gflop/s 8 w/out ILP 4 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio

Memory+Comp Ceilings 128 Memory Optimizations mul / add 64 32
Prefetching NUMA optimizations (use DRAM local to socket) mul / add 64 32 w/out SIMD 16 attainable Gflop/s 8 NUMA optimizations w/out ILP 4 prefetching 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio

Memory+Comp Ceilings 128 Partitions expected perf. into 3 optim. regions: Compute only Memory only Compute+ Memory mul / add 64 32 w/out SIMD 16 attainable Gflop/s 8 NUMA optimizations w/out ILP 4 prefetching 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio

Status of Roofline Model
Used for 2 other kernels on 4 other multicores Evaluate 2 financial PDE solvers Intel Penryn & Larrabee + NVIDIA B80 & GTX280 Versoin1 fit in L1$, enough BW for peak throughput Version 2 didn’t fit, Roofline helped figure out cache blocking to reach peak throughput We’re looking at non-floating point kernels e.g., Sort (potential exchanges/sec vs GB/s), Graph Traversal (nodes traversed/sec vs. GB/s) Opportunities for others to help investigate: many kernels, multicores, metrics For example, Jike Chong ported two financial PDE solvers to four other multicore computers: the Intel Penryn and Larrabee and NVIDIA G80 and GTX280.[9] He used the Roofline model to keep track the platforms' peak arithmetic throughput and L1, L2, and DRAM bandwidths. By analyzing an algorithm's working set and operational intensity, he was able to use the Roofline model to quickly estimate the needs for algorithmic improvements. Specifically, for the option-pricing problem with an implicit PDE solver, the working set is small enough to fit into L1 and the L1 bandwidth is sufficient to support peak arithmetic throughput, so the Roofline model indicates that no optimization is necessary. For option pricing with an explicit PDE formulation, the working set is too large to fit into cache, and the Roofline model helps to indicate the extent to which cache blocking is necessary to extract peak arithmetic performance

Final Performance dwarf Op. Int. Clovertown Barcelona Victoria Falls
Cell Blade SpMV 0.25 2.8 GF/s 4.2 GF/s 7.3 GF/s 11.8 GF/s Stencil 0.50 2.5 GF/s 8.0 GF/s 6.8 GF/s 14.2 GF/s LBMHD 1.07 5.6 GF/s 11.4 GF/s 10.5 GF/s 16.7 GF/s 3-D FFT 1.64 9.7 GF/s 14.0 GF/s 9.2 GF/s 15.7 GF/s Peak GF 75.0 GF/s 74.0 GF/s 19.0 GF/s 29.0 GF/s Ridge Pt 6.7 Fl/B 4.6 Fl/B 0.3 Fl/B 0.5 Fl/B

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Similar presentations

Presentation on theme: "The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Similar presentations

Presentation on theme: "The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable."— Presentation transcript:

Similar presentations

About project

Feedback