P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Lecture 6: Multicore Systems

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Evolution of Processor Architecture, and the Implications for Performance Optimization.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Introduction CS 524 – High-Performance Computing.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

1 Chapter 04 Authors: John Hennessy & David Patterson.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Sunpyo Hong, Hyesoon Kim

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Vector computers.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

COMP 740: Computer Architecture and Implementation

Parallel Processing - introduction

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang

Morgan Kaufmann Publishers

5.2 Eleven Advanced Optimizations of Cache Performance

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Samuel Williams1,2, David Patterson1,

for more information ... Performance Tuning

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

The Coming Multicore Revolution: What Does it Mean for Programming?

Auto-tuning Memory Intensive Kernels for Multicore

Memory System Performance Chapter 3

Kaushik Datta1,2, Mark Murphy2,

Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating Performance Counters into the Roofline model Samuel Williams 1,2 with Andrew Waterman 1, Heidi Pan 1,3, David Patterson 1, Krste Asanovic 1, Jim Demmel 1 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory 3 Massachusetts Institute of Technology

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Outline  Auto-tuning  Introduction to Auto-tuning  BeBOP’s previous performance counter experience  BeBOP’s current tuning efforts  Roofline Model  Motivating Example - SpMV  Roofline model  Performance counter enhanced Roofline model

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Motivation (Folded into Jim’s Talk)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Gini Coefficient  In economics, the Gini coefficient is a measure of the distribution of wealth within a society  As wealth becomes concentrated, the value of the coefficient increases, and the curve departs from a straight line. it’s a just an assessment of the distribution, not a commentary on what it should be 100% 0% 100% Uniform distribution of wealth Cumulative fraction of the total population Cumulative fraction of the total wealth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 What’s the Gini Coefficient for our Society?  By our society, I mean those working in the performance optimization and analysis world (tuners, profilers, counters)  Our wealth, is knowledge of tools and benefit gained from them. 100% 0% 100% value uniform across population Cumulative fraction of the total programmer population Cumulative fraction of the value of performance tools Entire benefit for the select few

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Why is it so low?  Apathy  Performance only matters after correctness  Scalability has won out over efficiency  Timescale for Moore’s law has been shorter than optimization  Ignorance / Lack of Specialized Education  Tools assume broad and deep architectural knowledge  Optimization may require detailed application knowledge  Significant SysAdmin support  Cryptic tools/presentation  Erroneous data  Frustration

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 To what value should we aspire?  Certainly unreasonable for every programmer to be cognizant of performance counters  Equally unreasonable for the benefit to be uniform  Making performance tools  more intuitive  more robust  easier to use (always on?)  essential in a multicore era will motivate more users to exploit them  Oblivious to programmers, compilers, architectures, and middleware may exploit performance counters to improve performance 100% 0% 100% value uniform across population Cumulative fraction of the total programmer population Cumulative fraction of the value of performance tools

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 I auto-tuning & performance counter experience

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Introduction to Auto-tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Out-of-the-box Code Problem  Out-of-the-box code has (unintentional) assumptions on:  cache sizes (>10MB)  functional unit latencies(~1 cycle)  etc…  These assumptions may result in poor performance when they exceed the machine characteristics

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Auto-tuning?  Trade up front loss in productivity cost for continued reuse of automated kernel optimization on other architectures  Given existing optimizations, Auto-tuning automates the exploration of the optimization and parameter space  Two components:  parameterized code generator (we wrote ours in Perl)  Auto-tuning exploration benchmark (combination of heuristics and exhaustive search)  Auto-tuners that generate C code provide performance portability across the existing breadth of architectures  Can be extended with ISA specific optimizations (e.g. DMA, SIMD) ClovertownSanta RosaNiagara2QS20 Cell Blade IntelAMDSunIBM (Breadth of Existing Architectures)

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 BeBOP’s Previous Performance Counter Experience (2-5 years ago)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Performance Counter Usage  Perennially, performance counters have been used  as a post-mortem to validate auto-tuning heuristics  to bound remaining performance improvement  to understand unexpectedly poor performance  However, this requires:  significant kernel and architecture knowledge  creation of a performance model specific to each kernel  calibration of the model  Summary: We’ve experienced a progressively lower benefit and confidence in their use due to the variation in the quality and documentation of performance counter implementations

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Experience (1)  Sparse Matrix Vector Multiplication (SpMV)  “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply” Applied to older Sparc, Pentium III, Itanium machines Model cache misses (compulsory matrix or compulsory matrix+vector) Count cache misses via PAPI Generally well bounded (but large performance bound)  “When Cache Blocking Sparse Matrix Vector Multiply Works and Why” Similar architectures Adds a fully associative TLB model (benchmarked TLB miss penalty) Count TLB misses (as well as cache misses) Much better correlation to actual performance trends  Only modeled and counted the total number of misses (bandwidth only).  Performance counters didn’t distinguish between ‘slow’ and ‘fast’ misses (i.e. didn’t account for exposed memory latency)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Experience (2)  MSPc/SIREV papers  Stencils (heat equation on a regular grid)  Used newer architectures (Opteron, Power5, Itanium2)  Attempted to model slow and fast misses (e.g. engaged prefetchers)  Modeling generally bounds performance and notes the trends  Attempted use performance counters to understand the quirks  Opteron and Power5 performance counters didn’t count prefetched data  Itanium performance counter trends correlated well with performance

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 BeBOP’s Current Tuning Efforts (last 2 years)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 BeBOP’s Current Tuning Efforts  Multicore (and distributed) oriented  Throughput Oriented Kernels on Multicore architectures:  Dense Linear Algebra (LU, QR, Cholesky, …)  Sparse Linear Algebra (SpMV, Iterative Solvers, …)  Structured Grids (LBMHD, stencils, …)  FFTs  SW/HW co-tuning  Collectives (e.g. block transfers)  Latency Oriented Kernels:  Collectives (Barriers, scalar transfers)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 (re)design for evolution  Design auto-tuners for an arbitrary number of threads  Design auto-tuners to address the limitations of the multicore paradigm  This will provide performance portability across both the existing breadth of multicore architectures as well as their evolution NehalemBarcelonaVictoria FallsQS22 Cell Blade Sandy Bridge Sao Paulo ?? ClovertownSanta RosaNiagara2QS20 Cell Blade IntelAMDSunIBM manycore (Breadth of Existing Architectures)

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 II Roofline Model: Facilitating Program Analysis and Optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 Motivating Example: Auto-tuning Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21 Sparse Matrix Vector Multiplication  What’s a Sparse Matrix ?  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data to reconstruct the matrix structure  What’s SpMV ?  Evaluate y=Ax  A is a sparse matrix, x & y are dense vectors  Challenges  Very low arithmetic intensity (often <0.166 flops/byte)  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD) (a) algebra conceptualization (c) CSR reference code for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; } Axy (b) CSR data structure A.val[ ] A.rowStart[ ]... A.col[ ]...

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 SpMV Performance (simple parallelization)  Out-of-the box SpMV performance on a suite of 14 matrices  Scalability isn’t great  Is this performance good? Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 SpMV Performance (simple parallelization)  Out-of-the box SpMV performance on a suite of 14 matrices  Scalability isn’t great  Is this performance good? Naïve Pthreads Naïve 1.9x using 8 threads 1.9x using 8 threads 43x using 128 threads 43x using 128 threads 3.4x using 4 threads 3.4x using 4 threads 2.5x using 8 threads 2.5x using 8 threads

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 SpMV Performance (simple parallelization)  Out-of-the box SpMV performance on a suite of 14 matrices  Scalability isn’t great  Is this performance good? Naïve Pthreads Naïve Parallelism resulted in better performance, but did it result in good performance? Parallelism resulted in better performance, but did it result in good performance?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Auto-tuned SpMV Performance (portable C)  Fully auto-tuned SpMV performance across the suite of matrices  Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Auto-tuned SpMV Performance (architecture specific optimizations)  Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Auto-tuned SpMV Performance (architecture specific optimizations)  Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some architectures?  Performance is better, but is performance good? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Auto-tuning resulted in even better performance, but did it result in good performance? Auto-tuning resulted in even better performance, but did it result in good performance?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Auto-tuned SpMV Performance (architecture specific optimizations)  Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some architectures?  Performance is better, but is performance good? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Should we spend another month optimizing it? Should we spend another month optimizing it?

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 How should the bulk of programmers analyze performance ?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 Spreadsheet of Performance Counters?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 VTUNE ?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Roofline Model 2 attainable Gflop/s Peak flops  It would be great if we could always get peak performance

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Roofline Model (2) 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ Peak flops Stream Bandwidth Kernel’s Arithmetic Intensity  Machines have finite memory bandwidth  Apply a Bound and Bottleneck Analysis  Still Unrealistically optimistic model Gflop/s(AI) = min Peak Gflop/s StreamBW * AI

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Naïve Roofline Model (applied to four architectures)  Bound and Bottleneck Analysis  Unrealistically optimistic model  Hand optimized Stream BW benchmark / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/2 124 peak DP 8 IBM QS20 Cell Blade Sun T2+ T5140 (Victoria Falls) Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) hand optimized Stream BW Gflop/s(AI) = min Peak Gflop/s StreamBW * AI

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Roofline model for SpMV  Delineate performance by architectural paradigm = ‘ceilings’  In-core optimizations 1..i  DRAM optimizations 1..j  FMA is inherent in SpMV (place at bottom) / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter GFlops i,j (AI) = min InCoreGFlops i StreamBW j * AI

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Roofline model for SpMV (overlay arithmetic intensity)  Two unit stride streams  Inherent FMA  No ILP  No DLP  FP is 12-25%  Naïve compulsory flop:byte < / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Roofline model for SpMV (out-of-the-box parallel)  Two unit stride streams  Inherent FMA  No ILP  No DLP  FP is 12-25%  Naïve compulsory flop:byte <  For simplicity: dense matrix in sparse format / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Roofline model for SpMV (NUMA & SW prefetch)  compulsory flop:byte ~  utilize all memory channels / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 Roofline model for SpMV (matrix compression)  Inherent FMA  Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 40 Roofline model for SpMV (matrix compression)  Inherent FMA  Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) Performance is bandwidth limited Performance is bandwidth limited

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 41 A Vision for BeBOP’s Future Performance Counter Usage

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 42 Deficiencies of the Roofline  The Roofline and its ceilings are architecture specific  They are not execution(runtime) specific  It requires the user to calculate the true arithmetic intensity including cache conflict and capacity misses.  Although the roofline is extremely visually intuitive, it only says what must be done by some agent (by compilers, by hand, by libraries)  It does not state in what aspect was the code deficient

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 43 Performance Counter Roofline (understanding performance)  In the worst case, without performance counter data performance analysis can be extremely non-intuitive  (delete the ceilings) / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Architecture-Specific Roofline peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA Stream BW Compulsory arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 44 Performance Counter Roofline (execution specific roofline)  Transition from an architecture specific roofline, to an execution-specific roofline / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Architecture-Specific Roofline peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA Stream BW / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Execution-Specific Roofline peak DP Stream BW Compulsory arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 45 Performance Counter Roofline (true arithmetic intensity)  Performance counters tell us the true memory traffic  Algorithmic Analysis tells us the useful flops  Combined we can calculate the true arithmetic intensity / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Architecture-Specific Roofline peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA Stream BW 1 2 flop:DRAM byte ratio Execution-Specific Roofline peak DP Stream BW 1 / 16 1/81/8 1/41/4 1/21/ Performance lost from low arithmetic Intensity (AI) True arithmetic intensity Compulsory arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 46 Performance Counter Roofline (true memory bandwidth)  Given the total memory traffic and total kernel time, we may also calculate the true memory bandwidth  Must include 3C’s + speculative loads / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Architecture-Specific Roofline peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA Stream BW 1 2 flop:DRAM byte ratio Execution-Specific Roofline True arithmetic intensity Compulsory arithmetic intensity 1 / 16 1/81/8 1/41/4 1/21/ peak DP Stream BW Performance lost from low AI and low bandwidth True Bandwidth Compulsory arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 47 Performance Counter Roofline (bandwidth ceilings)  Every idle bus cycle diminishes memory bandwidth  Use performance counters to bin memory stall cycles / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Architecture-Specific Roofline peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA Stream BW 1 2 flop:DRAM byte ratio Execution-Specific Roofline True arithmetic intensity Compulsory arithmetic intensity 1 / 16 1/81/8 1/41/4 1/21/ Stream BW peak DP Failed Prefetching Stalls from TLB misses NUMA asymmetry Compulsory arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 48 Performance Counter Roofline (in-core ceilings)  Measure imbalance between FP add/mul issue rates as well as stalls from lack of ILP and ratio of scalar to SIMD instructions  Must be modified by the compulsory work  e.g. placing a 0 in a SIMD register to execute the _PD form increases the SIMD rate but not the useful execution rate / 16 flop:DRAM byte ratio /81/8 1/41/4 1/21/ Architecture-Specific Roofline peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA Stream BW 1 2 flop:DRAM byte ratio Execution-Specific Roofline True arithmetic intensity Compulsory arithmetic intensity 1 / 16 1/81/8 1/41/4 1/21/ Stream BW peak DP Mul/add imbalance Lack of SIMD Lack of ILP Performance from optimizations by the compiler Compulsory arithmetic intensity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 49 Relevance to Typical Programmer  Visually Intuitive  With performance counter data its clear which optimizations should be attempted and what the potential benefit is. (must still be familiar with possible optimizations)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 50 Relevance to Auto-tuning?  Exhaustive search is intractable (search space explosion)  Propose using performance counters to guide tuning:  Generate an execution-specific roofline to determine which optimization(s) should be attempted next  From the roofline, its clear what doesn’t limit performance  Select the optimization that provides the largest potential gain e.g. bandwidth, arithmetic intensity, in-core performance  and iterate

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 51 Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 52 Concluding Remarks  Existing performance counter tools miss the bulk of programmers  The Roofline provides a nice (albeit imperfect) approach to performance/architectural visualization  We believe that performance counters can be used to generate execution-specific rooflines that will facilitate optimizations  However, real applications will run concurrently with other applications sharing resouces. This will complicate performance analysis  next speaker…

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 53 Acknowledgements  Research supported by:  Microsoft and Intel funding (Award # )  DOE Office of Science under contract number DE-AC02-05CH11231  NSF contract CNS  Sun Microsystems - Niagara2 / Victoria Falls machines  AMD - access to Quad-core Opteron (barcelona) access  Forschungszentrum Jülich - access to QS20 Cell blades  IBM - virtual loaner program to QS20/QS22 Cell blades

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 54 Questions ?

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 55 BACKUP SLIDES

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 56 What’s a Memory Intensive Kernel?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 57 Arithmetic Intensity in HPC  True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes  Arithmetic intensity is:  ultimately limited by compulsory traffic  diminished by conflict or capacity misses A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 58 Memory Intensive  A kernel is memory intensive when: the kernel’s arithmetic intensity < the machine’s balance (flop:byte)  If so, then we expect: Performance ~ Stream BW * Arithmetic Intensity  Technology allows peak flops to improve faster than bandwidth.  more and more kernels will be considered memory intensive

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 59 The Roofline Model 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s /41/4 1/21/ Log scale !

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 60 Deficiencies of Auto-tuning  There has been an explosion in the optimization parameter space.  Complicates the generation of kernels and their exploration  Currently we either:  Exhaustively search the space (increasingly intractable)  Apply very high level heuristics to eliminate much of it  Need a guided search that is cognizant of both architecture and performance counters.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 61 Deficiencies in usage of Performance Counters  Only counted the number of cache/TLB misses  We didn’t count exposed memory stalls (e.g. prefetchers)  We didn’t count NUMA asymmetry in memory traffic  We didn’t count coherency traffic  Tools can be buggy or not portable  Even worse is just giving a spread sheet filled with numbers and cryptic event names  In-core events are less interesting as more and more kernels become memory bound