The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Slides:

Advertisements

Similar presentations

Performance Models for Application Optimization

Advertisements

Chapter 7 Multicores, Multiprocessors, and Clusters.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

*University of Utah † Lawrence Berkeley National Laboratory

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.

Using LINPACK + STREAM to Build a Simple, Composite Metric for System Throughput John D. McCalpin, Ph.D. IBM Corporation Austin, TX Presented ,

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:

Intel and AMD processors

Itanium® 2 Processor Architecture

Konstantis Daloukas1 Christos D. Antonopoulos1 Nikolaos Bellas1 Sek M.

Lynn Choi School of Electrical Engineering

Chapter 6 Parallel Processors from Client to Cloud

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang

Cache Memory Presentation I

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Parallel Computers Today

Multi-/Many-Core Processors

Samuel Williams1,2, David Patterson1,

The Coming Multicore Revolution: What Does it Mean for Programming?

Auto-tuning Memory Intensive Kernels for Multicore

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Chapter 4 Multiprocessors

Multicore and GPU Programming

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Multicore and GPU Programming

CS 295: Modern Systems Cache And Memory System

Kaushik Datta1,2, Mark Murphy2,

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Presentation transcript:

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable Adaptive Distributed systems Lab (RAD Lab) U.C. Berkeley

Outline What Caused the Revolution? Is it Too Late to Stop It? Projected Hardware/Software Context? Why Might We Succeed (this time)? Example Coordinated Attack: Par Lab @ UCB Roofline: An Insightful Visual Performance Model Conclusion

Why Multicore Performance Model? No consensus on multicore architecture Number cores, thick vs. thin, SIMD/Vector or not, Caches vs. Local Stores, homogeneous or not, … Must Programmer become expert on application and all new computers to deliver good performance on multicore? If don’t care about performance, why multicore? 1% programmers know both?

Multicore SMPs (All Dual Sockets) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar (each direction) 4GB/s 2.33 GHz, 8 Fat cores, Memory Bus 2.30 GHz, 8 Fat cores Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE 512K L2 SPE 256K MFC (each direction) <20GB/s 1.17 GHz, 16 thin 8-way MT cores 3.20 GHz, 16 thin SIMD cores

Assumptions for New Model Focus on 13 Dwarfs Use floating point versions here; others in future Bound and Bottleneck Model good enough; e.g. Amdahl’s Law Don’t need accuracy to 10% to understand which optimizations to try to get next level of performance Can use SPMD programming model so parallelization, load balancing not issues Have looked at accomodating load balancing, parallelization, but not shown today

Bounds/Bottlenecks? For Floating Point Dwarfs: Peak Floating Point Performance Bound (Talk about non-floating point dwarfs later) For Dwarfs that don’t fit all in cache: DRAM Memory Performance Bound Operational Intensity: Average number of Floating Point Operations per Byte to DRAM Between cache and DRAM vs. processor and cache Varies by multicore design (cache org.) and dwarf

Can Graph Performance Model? For Floating Point Dwarfs, Y axis is performance or GFLOPs/second Log Scale

Can Graph Performance Model? Suppose X-axis was Operational Intensity? FLOPs per Bytes to DRAM Log scale Can plot Memory BW Bound (GBytes/sec) since GFLOPs/sec) (FLOPs/Byte) “Roofline” = GBytes/sec

Roofline Visual Performance Model “Ridge Point”: minimum Operational Intensity to get Peak Performance Operational Intensity is avg. FLOPs/Byte per dwarf Compute Bound? Memory Bound? Ridge Point What do Real Rooflines Look Like? Where do Real Dwarfs map on Roofline?

Roofline Models of Real Multicores 128 Intel Clovertown Peak: 75 GFLOPS Ridge Point: 6.7 AMD Barcelona Peak: 74 GFLOPS Ridge Point: 4.4 IBM Cell Blade Peak: 29 GFLOPS Ridge Point: 0.65 Sun Victoria Falls Peak: 19 GFLOPS Ridge Point: 0.33 Clovertown 64 Barcelona 32 Cell Blade peak DP 16 Victoria Falls attainable Gflop/s 25% FP 8 w/out SW prefetch w/out NUMA 12% FP 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio

Roofline Models of Real Multicores 128 Intel Clovertown Peak: 75 GFLOPS Ridge Point: 6.7 AMD Barcelona Peak: 74 GFLOPS Ridge Point: 4.4 IBM Cell Blade Peak: 29 GFLOPS Ridge Point: 0.65 Sun Victoria Falls Peak: 19 GFLOPS Ridge Point: 0.33 Clovertown 64 Barcelona 32 Cell Blade peak DP 16 Victoria Falls attainable Gflop/s 25% FP 8 w/out SW prefetch w/out NUMA 12% FP 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio Op. Int. 1/4 1/2 1.07 1.64 dwarf SpMV Stencil LBMHD 3-D FFT

What if Performance < Roofline? Measure computational and memory optimizations in advance Order computation optimizations and memory optimizations as “ceilings” below Roofline Need to perform optimization to break thru ceiling Use Operational Intensity to pick whether do memory or computation optimization (or both)

Adding Computational Ceilings 128 Has separate multipliers and adders: * = + ? SIMD? (2 Flops/Instr) 4 instructions per clock cycle? mul / add imbalance 64 32 w/out SIMD 16 attainable Gflop/s 8 w/out ILP 4 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio

Memory+Comp Ceilings 128 Memory Optimizations mul / add 64 32 Prefetching NUMA optimizations (use DRAM local to socket) mul / add 64 32 w/out SIMD 16 attainable Gflop/s 8 NUMA optimizations w/out ILP 4 prefetching 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio

Memory+Comp Ceilings 128 Partitions expected perf. into 3 optim. regions: Compute only Memory only Compute+ Memory mul / add 64 32 w/out SIMD 16 attainable Gflop/s 8 NUMA optimizations w/out ILP 4 prefetching 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio

Status of Roofline Model Used for 2 other kernels on 4 other multicores Evaluate 2 financial PDE solvers Intel Penryn & Larrabee + NVIDIA B80 & GTX280 Versoin1 fit in L1$, enough BW for peak throughput Version 2 didn’t fit, Roofline helped figure out cache blocking to reach peak throughput We’re looking at non-floating point kernels e.g., Sort (potential exchanges/sec vs GB/s), Graph Traversal (nodes traversed/sec vs. GB/s) Opportunities for others to help investigate: many kernels, multicores, metrics For example, Jike Chong ported two financial PDE solvers to four other multicore computers: the Intel Penryn and Larrabee and NVIDIA G80 and GTX280.[9] He used the Roofline model to keep track the platforms' peak arithmetic throughput and L1, L2, and DRAM bandwidths. By analyzing an algorithm's working set and operational intensity, he was able to use the Roofline model to quickly estimate the needs for algorithmic improvements. Specifically, for the option-pricing problem with an implicit PDE solver, the working set is small enough to fit into L1 and the L1 bandwidth is sufficient to support peak arithmetic throughput, so the Roofline model indicates that no optimization is necessary. For option pricing with an explicit PDE formulation, the working set is too large to fit into cache, and the Roofline model helps to indicate the extent to which cache blocking is necessary to extract peak arithmetic performance

Final Performance dwarf Op. Int. Clovertown Barcelona Victoria Falls Cell Blade SpMV 0.25 2.8 GF/s 4.2 GF/s 7.3 GF/s 11.8 GF/s Stencil 0.50 2.5 GF/s 8.0 GF/s 6.8 GF/s 14.2 GF/s LBMHD 1.07 5.6 GF/s 11.4 GF/s 10.5 GF/s 16.7 GF/s 3-D FFT 1.64 9.7 GF/s 14.0 GF/s 9.2 GF/s 15.7 GF/s Peak GF 75.0 GF/s 74.0 GF/s 19.0 GF/s 29.0 GF/s Ridge Pt 6.7 Fl/B 4.6 Fl/B 0.3 Fl/B 0.5 Fl/B