P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.

Slides:



Advertisements
Similar presentations
Chapter 7 Multicores, Multiprocessors, and Clusters.
Advertisements

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
1 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D) Sam Williams 1,2, Jonathan Carter 2, Leonid Oliker 2, John Shalf 2, Katherine Yelick 1,2 1 University.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
1 Chapter 04 Authors: John Hennessy & David Patterson.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,
The Potential of the Cell Processor for Scientific Computing
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Sunpyo Hong, Hyesoon Kim
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
My Coordinates Office EM G.27 contact time:
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Chapter 6 Parallel Processors from Client to Cloud
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
5.2 Eleven Advanced Optimizations of Cache Performance
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Parallel Computers Today
Multi-/Many-Core Processors
Samuel Williams1,2, David Patterson1,
Presented by: Isaac Martin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
The Coming Multicore Revolution: What Does it Mean for Programming?
Auto-tuning Memory Intensive Kernels for Multicore
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Kaushik Datta1,2, Mark Murphy2,
Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical tool for program analysis and optimization ParLab Summer Retreat Samuel Williams, David Patterson

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Motivation  Performance and scalability of multicore architectures can be extremely non-intuitive to novice programmers  Success of the multicore paradigm should be premised on augmenting the abilities of the world’s programmers

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Goals & Audience  Focused on: rates and efficiencies (Gflop/s, % of peak),  Goals for Roofline:  Provide everyone with a graphical aid that provides: realistic expectations of performance and productivity  Show inherent hardware limitations for a given kernel  Show potential benefit and priority of optimizations  Who’s not the audience for the Roofline:  Not for those interested in fine tuning (+5%)  Not for those challenged by parallel kernel correctness

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Principal Components of Performance

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Components  There are three principal components to performance:  Computation  Communication  Locality  Each architecture has a different balance between these  Each kernel has a different balance between these  Performance is a question of how well an kernel’s characteristics map to an architecture’s characteristics

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Computation  For us, floating point performance (Gflop/s) is the metric of interest (typically double precision)  Peak in-core performance can only be attained if:  fully exploit ILP, DLP, FMA, etc…  non-FP instructions don’t sap instruction bandwidth  threads don’t diverge (GPUs)  transcendental/non pipelined instructions are used sparingly  branch mispredictions are rare  To exploit a form of in-core parallelism, it must be:  Inherent in the algorithm  Expressed in the high level implementation  Explicit in the generated code

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Communication  For us, DRAM bandwidth (GB/s) is the metric of interest  Peak bandwidth can only be attained if certain optimizations are employed:  Few unit stride streams  NUMA allocation and usage  SW Prefetching  Memory Coalescing (GPU)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Locality  Computation is free, Communication is expensive.  Maximize locality to minimize communication  There is a lower limit to communication: compulsory traffic  Hardware changes can help minimize communication  Larger cache capacities minimize capacity misses  Higher cache associativities minimize conflict misses  Non-allocating caches minimize compulsory traffic  Software optimization can also help minimize communication  Padding avoids conflict misses  Blocking avoids capacity misses  Non-allocating stores minimize compulsory traffic

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Roofline Model

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Integrating Components…  Goal: integrate in-core performance, memory bandwidth, and locality into a single readily understandable performance figure  Also, must graphically show the penalty associated with not including certain software optimizations  Roofline model will be unique to each architecture  Coordinates of a kernel are ~unique to each architecture

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 What relates GB/s to GFlop/s ?  Through dimensional analysis, its clear that Flops:Bytes is the parameter that allows us to convert bandwidth (GB/s) to performance (GFlop/s)  This is a well known quantity: Arithmetic Intensity (discussed later)  When we measure total bytes, we incorporate all cache behavior (the 3C’s) and Locality

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Basic Roofline  Performance is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio Gflop/s = min Peak Gflop/s Stream BW * actual flop:byte ratio

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 notes  Bandwidth #’s collected via micro benchmarks  Computation #’s derived from optimization manuals (pencil and paper)  Assume complete overlap of either communication or computation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  Peak roofline performance  based on manual for single precision peak  and a hand tuned stream read for bandwidth 1 peak SP peak stream bandwidth Log scale !

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  Peak roofline performance  based on manual for single precision peak  and a hand tuned stream read for bandwidth 1 peak SP peak stream bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  Opterons have separate multipliers and adders  ‘functional unit parallelism’  This is a ceiling beneth the roofline 1 peak SP mul / add imbalance peak stream bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  In single precision, SIMD is 4x32b.  If only the _ss versions are used, performance is 1/4 1 peak SP mul / add imbalance peak stream bandwidth w/out SIMD

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  If 4 independent instructions are kept in the pipeline, performance will fall 1 peak SP mul / add imbalance w/out ILP peak stream bandwidth w/out SIMD

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  If SW prefetching is not used, performance will degrade  These act as ceilings below the bandwidth roofline 1 peak SP mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out SIMD

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  Without NUMA optimizations, the memory controllers on the second socket can’t be used. 1 peak SP mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  Bandwidth is much lower without unit stride streams 1 peak SP mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD w/out unit stride streams

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ AMD Opteron 2356 (Barcelona)  Its difficult for any architecture to reach the raw DRAM bandwidth 1 peak SP mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD w/out unit stride streams raw DRAM bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Roofline model for Opteron (adding ceilings) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Partitions the regions of expected performance into three optimization regions:  Compute only  Memory only  Compute+Memory 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD w/out unit stride streams

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Uniqueness  There is no single ordering or roofline model  The order of ceilings is generally (bottom up):  What is inherent in algorithm  What a compiler is likely to provide  What a programmer could provide  What can never be exploited for this kernel  For example,  FMA or mul/add balance is inherent in many linear algebra routines and should be placed at the bottom.  However, many stencils are dominated by adds, and thus the multipliers and FMA go underutilized.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Arithmetic Intensity in HPC  Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes  Some HPC kernels have an arithmetic intensity that’s constant, but on others it scales with with problem size (increasing temporal locality)  Actual arithmetic intensity is capped by cache/local store capacity A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Accurately Determining the true Flop:DRAM Byte ratio  Remember the 3C’s of caches  Calculating the Flop:DRAM byte ratio is:  Compulsory misses: straightforward  Capacity misses: pencil and paper (maybe performance counters)  Conflict misses: must use performance counters  Flop:actual DRAM Byte ratio < Flop:compulsory DRAM Byte ratio  One might place a range on the arithmetic intensity ratio  Thus performance is limited to an area between the ceilings and between the upper (compulsory) and lower bounds on arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Roofline model for Opteron (powerpoint doodle) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Final Roofline 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Roofline model for Opteron (powerpoint doodle) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Some arbitrary kernel has a flop:compulsory byte ratio of 4  Overlaid on the roofline  Defines upper bound on range of expected performance  Also shows which optimizations are likely 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD compulsory misses

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 Roofline model for Opteron (powerpoint doodle) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Capacity misses reduce the actual flop:byte ratio  Also reduces attainable performance  AI is unique to each combination of kernel and architecture 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD compulsory missescapacity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 Roofline model for Opteron (powerpoint doodle) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Conflict misses may destroy performance  AI is unique to each combination of kernel and architecture 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD compulsory missescapacityconflict

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Roofline model for Opteron (powerpoint doodle) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Conflict misses may destroy performance 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD compulsory missescapacityconflict Doubling cache capacity doesn’t double the arithmetic intensity!

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Three Categories of Software Optimization

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Maximizing Attained in-core Performance 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Software optimizations such as explicit SIMDization can punch through the horizontal ceilings (what can be expected from a compiler)  Other examples include loop unrolling, reordering, and long running loops 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD compulsory missescapacityconflict

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Maximizing Attained Memory Bandwidth 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Compilers won’t give great out-of-the box bandwidth  Punch through bandwidth ceilings:  Maximize MLP  long unit stride accesses  NUMA aware allocation and parallelization  SW prefetching 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD compulsory missescapacityconflict

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Minimizing Memory Traffic 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Use performance counters to measure flop:byte ratio (AI)  Out-of-the-box code may have an AI ratio much less than the compulsory ratio  Be cognizant of cache capacities, associativities, and threads sharing it  Pad structures to avoid conflict misses  Use cache blocking to avoid capacity misses  These optimizations can be imperative 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD capacityconflict compulsory misses

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 compulsory misses Effective Roofline (before) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  Before optimization, traffic, and limited bandwidth optimization limits performance to a very narrow window 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD capacityconflict

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Effective Roofline (after) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/  After optimization, ideally, performance is significantly better 1 peak SP AMD Opteron 2356 (Barcelona) mul / add imbalance w/out ILP peak stream bandwidth w/out SW prefetching w/out NUMA optimizations w/out SIMD capacityconflict compulsory misses

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Applicable to Other Architectural Paradigms ?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 Four Architectures 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) Sun Victoria FallsAMD Barcelona NVIDIA G80IBM Cell Blade

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 40 Four Architectures 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) Sun Victoria FallsAMD Barcelona NVIDIA G80IBM Cell Blade One Thread / Core Multithreaded Cores

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 41 Four Architectures 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) Sun Victoria FallsAMD Barcelona NVIDIA G80IBM Cell Blade Cache-based Local Store-based

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 42 32b Rooflines for the Four (in-core parallelism) 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ Sun Victoria FallsAMD Barcelona NVIDIA G80  Single Precision Roofline models for the SMPs used in this work.  Based on micro- benchmarks, experience, and manuals  Ceilings = in-core parallelism  Can the compiler find all this parallelism ?  NOTE:  log-log scale  Assumes perfect SPMD 4 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) /41/4 1/21/ IBM Cell Blade peak SP mul / add imbalance w/out SIMD w/out ILP peak SP w/out FMA w/out SIMD w/out ILP peak SP w/out FMA w/out memory coalescing w/out NUMAw/out SW prefetch w/out NUMAw/out SW prefetch w/out NUMA w/out DMA concurrency

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 43 32b Rooflines for the Four (diverged threads) 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ Sun Victoria FallsAMD Barcelona NVIDIA G80  G80 dynamically finds DLP (shared instruction fetch)  SIMT  If threads of a warp diverge from SIMD execution, performance is limited by instruction issue bandwidth  Ceilings on G80 = number of unique PCs when threads diverge 4 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) /41/4 1/21/ IBM Cell Blade peak SP mul / add imbalance w/out SIMD w/out ILP peak SP w/out FMA w/out SIMD w/out ILP w/out NUMAw/out SW prefetch w/out NUMAw/out SW prefetch w/out NUMA w/out DMA concurrency peak SP 4 PCs w/out memory coalescing 8 PCs 16 PCs 32 PCs

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 44 32b Rooflines for the Four (FP fraction of dynamic instructions) 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ Sun Victoria FallsAMD Barcelona NVIDIA G80  Some kernels have large numbers of non FP instructions  Saps instruction issue bandwidth  Ceilings = FP fraction of dynamic instruction mix  NOTE:  Assumes perfect in-core parallelism 4 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) /41/4 1/21/ IBM Cell Blade w/out SW prefetch peak SP w/out NUMA FP = 25% FP = 12% FP = 6% peak SP w/out NUMA w/out DMA concurrency FP = 25% FP = 12% FP = 6% peak SP w/out NUMA FP = 25% FP = 12% peak SP w/out memory coalescing FP = 50% FP = 25% FP = 12% FP = 6%

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 45 32b Rooflines for the Four (ridge point) 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ Sun Victoria FallsAMD Barcelona NVIDIA G80  Some architectures have drastically different ridge points  VF may be compute bound on many kernels  Clovertown has 1 / 3 the BW of Barcelona = ridge point to the right 4 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) /41/4 1/21/ IBM Cell Blade w/out SW prefetch peak SP w/out NUMA FP = 25% FP = 12% FP = 6% peak SP w/out NUMA w/out DMA concurrency FP = 25% FP = 12% FP = 6% peak SP w/out NUMA FP = 25% FP = 12% peak SP w/out memory coalescing FP = 50% FP = 25% FP = 12% FP = 6%

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 46 Using Roofline when Auto-tuning HPC Kernels

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 47 Multicore SMPs Used 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) GB/s(write)21.33 GB/s(read) GB/s Core FSB Core GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 48 Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 49 Sparse Matrix Vector Multiplication  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector  Difficult to load balance  Very low arithmetic intensity (often <0.166 flops/byte) = likely memory bound Axy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 50 Roofline model for SpMV  Double precision roofline models  FMA is inherent in SpMV (place at bottom) No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 51 Roofline model for SpMV  Two unit stride streams  Inherent FMA  No ILP  No DLP  FP is 12-25%  Naïve compulsory flop:byte < No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 52 Roofline model for SpMV (out-of-the-box parallel)  Two unit stride streams  Inherent FMA  No ILP  No DLP  FP is 12-25%  Naïve compulsory flop:byte <  For simplicity: dense matrix in sparse format No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 53 Roofline model for SpMV (NUMA & SW prefetch)  compulsory flop:byte ~  utilize all memory channels No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 54 Roofline model for SpMV (matrix compression)  Inherent FMA  Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 55 Lattice-Boltzmann Magneto- Hydrodynamics (LBMHD) Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), Best Paper, Application Track

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 56 LBMHD  Plasma turbulence simulation via Lattice Boltzmann Method  Two distributions:  momentum distribution (27 scalar components)  magnetic distribution (15 vector components)  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector)  Must read 73 doubles, and update 79 doubles per point in space  Requires about 1300 floating point operations per point in space  Just over 1.0 flops/byte (ideal) momentum distribution Y Z +X magnetic distribution Y +Z +X macroscopic variables +Y +Z +X

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 57 Roofline model for LBMHD  Huge datasets  NUMA allocation/access  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Essentially random access to memory  Flop:byte ratio ~0.7  High conflict misses No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 58 Roofline model for LBMHD (out-of-the-box code)  Huge datasets  NUMA allocation/access  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Essentially random access to memory  Flop:byte ratio ~0.7  High conflict misses  Peak VF performance with 64 threads (our of 128) - high conflict misses No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 59 Roofline model for LBMHD (Padding, Vectorization, Unrolling, Reordering)  Vectorize the code to eliminate TLB capacity misses  Ensures unit stride access (bottom bandwidth ceiling)  Tune for optimal VL  Clovertown pinned to lower BW ceiling No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 60 Roofline model for LBMHD (SIMDization + cache bypass)  Make SIMDization explicit  Technically, this swaps ILP and SIMD ceilings  Use cache bypass instruction: movntpd  Increases flop:byte ratio to ~1.0 on x86/Cell

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 61 The Heat Equation Stencil Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, submitted to Supercomputing (SC), 2008.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 62 The Heat Equation Stencil PDE grid +Y +Z +X stencil for heat equation PDE y+1 y-1 x-1 z-1 z+1 x+1 x,y,z  Explicit Heat equation on a regular grid  Jacobi  One double per point in space  7-point nearest neighbor stencil  Must:  read every point from DRAM  perform 8 flops (linear combination)  write every point back to DRAM  Just over 0.5 flops/byte (ideal)  Cache locality is important

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 63 Roofline model for Stencil (out-of-the-box code)  Large datasets  2 unit stride streams  No NUMA  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Ideal flop:byte ratio 1 / 3  High locality requirements  Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 64 Roofline model for Stencil (out-of-the-box code)  Large datasets  2 unit stride streams  No NUMA  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Ideal flop:byte ratio 1 / 3  High locality requirements  Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 65 Roofline model for Stencil (NUMA, cache blocking, unrolling, prefetch, …)  Cache blocking helps ensure flop:byte ratio is as close as possible to 1 / 3  Clovertown has huge caches but is pinned to lower BW ceiling  Cache management is essential when capacity/thread is low No naïve Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 66 Roofline model for Stencil (SIMDization + cache bypass)  Make SIMDization explicit  Technically, this swaps ILP and SIMD ceilings  Use cache bypass instruction: movntpd  Increases flop:byte ratio to ~0.5 on x86/Cell

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 67 Refining the Roofline

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 68 Other performance metrics  There is no reason either floating point (Gflop/s) must be the performance metric  Could also use:  Graphics (Pixels, Vertices, Textures)  Crypto  Integer  Bitwise  etc…

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 69  For our kernels, DRAM bandwidth is the key communication component.  For other kernels, other bandwidths might be more appropriate  L2 bandwidth (e.g. DGEMM)  PCIe bandwidth (offload to GPU)  Network bandwidth  The example below shows zero overhead double buffered transfers to/from a GPU over PCIe x16  How bad is a SP stencil ?  What about SGEMM ?  No overlap / high overhead tends to smooth performance  Performance is half at ridge point Other bandwidths 1 flop:PCIe byte ratio attainable Gflop/s (32b) NVIDIA G peak SP FP = 50% FP = 25% FP = 12% FP = 6% full duplex half duplex

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 70 Mix and match  In general, you can mix and match as the kernel/architecture requires:  e.g. all posibilities is the cross product of performance metrics with bandwidths {Gflop/s, GIPS, crypto, …}  {L2, DRAM, PCIe, Network}

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 71 Conclusions  The Roofline Model provides an intuitive graph for kernel analysis and optimization  Easily extendable to other architectural paradigms  Easily extendable to other communication or computation metrics

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 72 Questions ?

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 73 BACKUP SLIDES

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 74 Trends  ILP is decreasing (shorter pipelines, multithreading)  SIMD is becoming wider  Bandwidth isn’t keeping up with #cores  Application flop:byte is decreasing  Cache/Local Store management is becoming critical 1.6% 1/81/8 flop:DRAM byte ratio % of peak Gflop/s 3.1% 6.4% 12.5% 25.0% 50.0% 100% 1/41/4 1/21/ peak SP mul / add imbalance w/out SIMD peak stream bandwidth w/out memory optimizations w/out ILP

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 75 Trends  ILP is decreasing (shorter pipelines, multithreading)  SIMD is becoming wider  Bandwidth isn’t keeping up with #cores  Application flop:byte is decreasing  Cache/Local Store management is becoming critical 1.6% 1/81/8 flop:DRAM byte ratio % of peak Gflop/s 3.1% 6.4% 12.5% 25.0% 50.0% 100% 1/41/4 1/21/ peak SP mul / add imbalance w/out SIMD peak stream bandwidth w/out memory optimizations w/out ILP

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 76 Trends  ILP is decreasing (shorter pipelines, multithreading)  SIMD is becoming wider  Bandwidth isn’t keeping up with #cores  Application flop:byte is decreasing  Cache/Local Store management is becoming critical 1.6% 1/81/8 flop:DRAM byte ratio % of peak Gflop/s 3.1% 6.4% 12.5% 25.0% 50.0% 100% 1/41/4 1/21/ peak SP mul / add imbalance w/out SIMD peak stream bandwidth w/out memory optimizations w/out ILP

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 77 Trends  ILP is decreasing (shorter pipelines, multithreading)  SIMD is becoming wider  Bandwidth isn’t keeping up with #cores  Application flop:byte is decreasing  Cache/Local Store management is becoming critical 1.6% 1/81/8 flop:DRAM byte ratio % of peak Gflop/s 3.1% 6.4% 12.5% 25.0% 50.0% 100% 1/41/4 1/21/ peak SP mul / add imbalance w/out SIMD peak stream bandwidth w/out memory optimizations w/out ILP

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 78 Trends  ILP is decreasing (shorter pipelines, multithreading)  SIMD is becoming wider  Bandwidth isn’t keeping up with #cores  Application flop:byte is decreasing  Cache/Local Store management is becoming critical 1.6% 1/81/8 flop:DRAM byte ratio % of peak Gflop/s 3.1% 6.4% 12.5% 25.0% 50.0% 100% 1/41/4 1/21/ peak SP mul / add imbalance w/out SIMD peak stream bandwidth w/out memory optimizations w/out ILP

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 79 Alternate View (Barcelona) 2 1/21/2 attainable GB/s attainable Gflop/s (32b)  Same formalism can be plotted on different axis  Difficult to plot AI 1 peak SP mul / add imbalance w/out ILP peak stream bandwidthw/out SW prefetchingw/out NUMA optimizations w/out SIMD w/out unit stride streams

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 80 No Overlap  What if computation or communication isn’t totally overlapped  At ridgepoint 50% of the time is spent in each, so performance is cut in half  In effect, the curves are smoothed  Common for bulk synchonous MPI communication, atypical for DRAM access on modern architectures

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 81 Roofline model for Opteron (non-overlapped) AMD Opteron (rev.F)  Not typical of multi- thread/core architectures  Not typical of architectures with ooo or HW prefetchers  More common in network accesses /81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ peak DP mul / add imbalance w/out ILP or SIMD peak stream bandwidth w/out SW prefetching w/out NUMA optimizations

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 82 Auto-tuned Performance (+Cell/SPE version)  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster (median) than using the PPEs alone. IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD Opteron (rev.F)Intel Xeon (Clovertown) +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 83 Auto-tuned Performance (Local Store Implementation)  First attempt at cell implementation.  VL, unrolling, reordering fixed  No NUMA  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Intel Xeon (Clovertown)AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs) * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 84 Where do GPUs fit in?  GPUs discover data level parallelism from thread level parallelism at runtime VLIW SIMD, Vector superscalar, SMT, etc… G80 Expressed at compile time Discovered at run time Instruction Level Parallelism Data Level Parallelism