P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Lecture 6: Multicore Systems

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Introductory Courses in High Performance Computing at Illinois David Padua.

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 17 Parallel Processing.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

University of California, Berkeley

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Memory System Performance Chapter 3

Kaushik Datta1,2, Mark Murphy2,

Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for Cache-Based Multicore Platforms Kaushik Datta Dissertation Talk December 4, 2009

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Motivation  Multicore revolution has produced wide variety of architectures  Compilers alone fail to fully exploit multicore resources  Hand-tuning has become infeasible  We need a better solution! 2

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Contributions  We have created an automatic stencil tuner (auto-tuner) that achieves up to 5.4x speedups over naïvely threaded stencil code  We have developed an “Optimized Stream” benchmark for determining a system’s highest attainable memory bandwidth  We have bound stencil performance using the Roofline Model and in-cache performance 3

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Stencil Auto-tuning Results 4

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil Code Overview  For a given point, a stencil is a fixed subset of nearest neighbors  A stencil code updates every point in a regular grid by “applying a stencil”  Used in iterative PDE solvers like Jacobi, Multigrid, and AMR  Also used in areas like image processing and geometric modeling  This talk will focus on three stencil kernels:  3D 7-point stencil  3D 27-point stencil  3D Helmholtz kernel Adaptive Mesh Refinement (AMR) 3D 7-point stencil (x,y,z) x+1 x-1 y-1 y+1 z-1 z+1 3D regular grid 5

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Arithmetic Intensity  AI is rough indicator of whether kernel is memory or compute-bound  Counting only compulsory misses:  Stencil codes usually (but not always) bandwidth-bound  Long unit-stride memory accesses  Little reuse of each grid point  Few flops per grid point  Actual AI values are typically lower (due to other types of cache misses) (Ratio of flops to DRAM bytes) Arithmetic Intensity Computation Bound Memory Bound O(n)O(log n)O(1) DGEMMFFT Stencil, SpMV

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 7  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Stencil Auto-tuning Results

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 8 Intel Clovertown IBM Blue Gene/P

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 9 Intel Clovertown IBM Blue Gene/P PowerPCSPARC x86 ISA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 10 Intel Clovertown IBM Blue Gene/P Dual Issue/ In-order Superscalar/ Out-of-order Core Type

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 11 Intel Clovertown IBM Blue Gene/P Socket/Core/ Thread Count 2 sockets x 4 cores x 1 thread 2 sockets x 4 cores x 2 threads 2 sockets x 4 cores x 1 thread 1 socket x 4 cores x 1 thread 2 socket x 8 cores x 8 threads

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 12 Intel Clovertown IBM Blue Gene/P Total HW Thread Count

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 13 Intel Clovertown IBM Blue Gene/P Stream Copy Bandwidth (GB/s)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel NehalemAMD Barcelona Sun Niagara2 14 Intel Clovertown IBM Blue Gene/P Peak DP Computation Rate (GFlop/s)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 15  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Stencil Auto-tuning Results

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB General Compiler Deficiencies  Historically, compilers have had problems with domain-specific transformations:  Register allocation (explicit temps)  Loop unrolling  Software pipelining  Tiling  SIMDization  Common subexpression elimination  Data structure transformations  Algorithmic transformations  Compilers typically use heuristics (not actual runs) to determine the best code for a platform  Difficult to generate optimal code across many diverse multicore architectures Domain-specific Hard Easy 16

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Rise of Automatic Tuning  Auto-tuning became popular because:  Domain-specific transformations could be included  Runs experiments instead of heuristics  Diversity of systems (and now increasing core counts) made performance portability vital  Auto-tuning is:  Portable (to an extent)  Scalable  Productive (if tuning for multiple architectures)  Applicable to many metrics of merit (e.g. performance, power efficiency)  We let the machine search the parameter space intelligently to find a (near-)optimal configuration  Serial processor success stories: FFTW, Spiral, Atlas, OSKI, others… 17

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 18  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Identify motif-specific optimizations  Generate code variants based on these optimizations  Traverse parameter space for best configuration  Stencil Auto-tuning Results

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Problem Decomposition Thread Blocking CYCZ CX TYTX Exploit caches shared among threads within a core (across an SMP) Register Blocking RY TY CZ TX RX RZ Loop unrolling in any of the three dimensions Makes DLP/ILP explicit  This decomposition is universal across all examined architectures  Decomposition does not change data structure  Need to choose best block sizes for each hierarchy level 19 Low Cache Capacity Per Thread Poor Register And Functional Unit Usage +Y+Y +Z+Z Core Blocking +X (unit stride) NYNZ NX Allows for domain decomposition and cache blocking Parallelization and Capacity Misses

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Data Allocation 20 NUMA-Aware Allocation Ensures that the data is co- located on same socket as the threads processing it Poor Data Placement Thread 0 Thread 1 Thread n … padding Array Padding Alters the data placement so as to minimize conflict misses Tunable parameter Conflict Misses

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Bandwidth Optimizations 21 Software Prefetching … A[i-1] A[i] A[i+1] A[i+dist-1] A[i+dist] A[i+dist+1] … Processing Retrieving from DRAM Helps mask memory latency by adjusting look-ahead distance Can also tune number of software prefetch requests Write Array DRAM Read Array Chip 8 B/point read 8 B/point write 8 B/point read Cache Bypass Eliminates cache line fills on a write miss Reduces memory traffic by 50% on write misses! Only available on x86 machines Low Memory Bandwidth Unneeded Write Allocation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB In-Core Optimizations 22 Register Blocking RY TY CZ TX RX RZ Poor Register And Functional Unit Usage Explicit SIMDization Legal and fast Alignment16B 8B 16B 8B 16B Legal but slow x86 SIMD Single instruction processes multiple data items Non-portable code Compiler not exploiting the ISA Common Subexpression Elimination Reduces flops by removing redundant expressions icc and gcc often fail to do this c = a+b; d = a+b; e = c+d; c = a+b; e = c+c; Unneeded flops are being performed

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 23  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Identify motif-specific optimizations  Generate code variants based on these optimizations  Traverse parameter space for best configuration  Stencil Auto-tuning Results

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil Code Evolution Naïve Code Hand- tuned Code Perl Code Generator Intelligent Code Generator KaushikShoaib  Hand-tuned code only performs well on a single platform  Perl code generator can produce many different code variants for performance portability  Intelligent code generator can take pseudo-code and specified set of transformations to produce code variants  Type of domain-specific compiler

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 25  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Identify motif-specific optimizations  Generate code variants based on these optimizations  Traverse parameter space for best configuration  Stencil Auto-tuning Results

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Traversing the Parameter Space  We introduced 9 different optimizations, each of which has its own set of parameters  Exhaustive search is impossible  To make problem tractable, we: Used expert knowledge to order the optimizations Applied them consecutively  Every platform had its own set of best parameters 26 Opt. #1 Parameters Opt. #2 Parameters Opt. #3 Parameters

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 27  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Stencil Auto-tuning Results  3D 7-Point Stencil (Memory-Intensive Kernel)  3D 27-Point Stencil (Compute-Intensive Kernel)  3D Helmholtz Kernel

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3D 7-Point Stencil Problem  The 3D 7-point stencil performs:  8 flops per point  16 or 24 Bytes of memory traffic per point  AI is either 0.33 or 0.5 (w/ cache bypass)  This kernel should be memory-bound on most architectures:  We will perform a single out-of-place sweep of this stencil over a grid 28 Computation Bound Memory Bound Ideal Arithmetic Intensity point stencil27-point stencil Helmholtz kernel

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Stencil Code  We wish to exploit multicore resources  First attempt at writing parallel stencil code:  Use pthreads  Parallelize in least contiguous grid dimension  Thread affinity for scaling: multithreading, then multicore, then multisocket x y z (unit-stride) regular grid Thread 0 Thread 1 Thread n … 29

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Performance Intel NehalemAMD Barcelona Sun Niagara2 30 Intel Clovertown IBM Blue Gene/P 47% of Performance Limit 19% of Performance Limit17% of Performance Limit 23% of Performance Limit16% of Performance Limit (3D 7-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuned Performance Intel NehalemAMD Barcelona Sun Niagara2 31 Intel Clovertown IBM Blue Gene/P (3D 7-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Scalability? Intel NehalemAMD Barcelona Sun Niagara2 32 Intel Clovertown IBM Blue Gene/P 1.9x for 8 cores 4.5x for 8 cores 4.4x for 8 cores 3.9x for 4 cores 8.6x for 16 cores Parallel Scaling Speedup Over Single Core Performance (3D 7-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB How much improvement is there? Intel NehalemAMD Barcelona Sun Niagara2 33 Intel Clovertown IBM Blue Gene/P 1.9x4.9x5.4x 4.4x4.7x Tuning Speedup Over Best Naïve Performance (3D 7-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB How well can we do? 34 (3D 7-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 35  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Stencil Auto-tuning Results  3D 7-Point Stencil (Memory-Intensive Kernel)  3D 27-Point Stencil (Compute-Intensive Kernel)  3D Helmholtz Kernel

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3D 27-Point Stencil Problem  The 3D 27-point stencil performs:  30 flops per point  16 or 24 Bytes of memory traffic per point  AI is either 1.25 or 1.88 (w/ cache bypass)  CSE can reduce the flops/point  This kernel should be compute-bound on most architectures:  We will perform a single out-of-place sweep of this stencil over a grid 36 Computation Bound Memory Bound Ideal Arithmetic Intensity point stencil27-point stencil Helmholtz kernel

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Performance Intel NehalemAMD Barcelona Sun Niagara2 37 Intel Clovertown IBM Blue Gene/P (3D 27-Point Stencil) 47% of Performance Limit33% of Performance Limit 17% of Performance Limit 35% of Performance Limit 47% of Performance Limit

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuned Performance Intel NehalemAMD Barcelona Sun Niagara2 38 Intel Clovertown IBM Blue Gene/P (3D 27-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Scalability? Intel NehalemAMD Barcelona Sun Niagara2 39 Intel Clovertown IBM Blue Gene/P (3D 27-Point Stencil) 2.7x for 8 cores 8.1x for 8 cores 5.7x for 8 cores 4.0x for 4 cores 12.8x for 16 cores Parallel Scaling Speedup Over Single Core Performance

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB How much improvement is there? Intel NehalemAMD Barcelona Sun Niagara2 40 Intel Clovertown IBM Blue Gene/P (3D 27-Point Stencil) 1.9x3.0x3.8x 2.9x1.8x Tuning Speedup Over Best Naïve Performance

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB How well can we do? 41 (3D 27-Point Stencil)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Outline 42  Stencil Code Overview  Cache-based Architectures  Auto-tuning Description  Stencil Auto-tuning Results  3D 7-Point Stencil (Memory-Intensive Kernel)  3D 27-Point Stencil (Compute-Intensive Kernel)  3D Helmholtz Kernel

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3D Helmholtz Kernel Problem  The 3D Helmholtz kernel is very different from the previous kernels:  Gauss-Seidel Red-Black ordering  25 flops per stencil  7 arrays (6 are read only, 1 is read and write)  Many small subproblems- no longer one large problem  Ideal AI is about 0.20  This kernel should be memory-bound on most architectures: 43 Computation Bound Memory Bound Ideal Arithmetic Intensity point stencil27-point stencil Helmholtz kernel

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3D Helmholtz Kernel Problem  Chombo (an AMR framework) deals with many small subproblems of varying dimensions  To mimic this, we varied the subproblem sizes:  We also varied the total memory footprint:  We also introduced a new parameter- the number of threads per subproblem GB 1 GB 2 GB 4 GB

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Single Iteration  1-2 threads per problem is optimal in cases where load balancing is not an issue  If this trend continues, load balancing will be an even larger issue in the manycore era 45 (3D Helmholtz Kernel) Intel NehalemAMD Barcelona

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Multiple Iterations  This is performance of 16 3 subproblems in a 0.5 GB memory footprint  Performance gets worse with more threads per subproblem 46 (3D Helmholtz Kernel) Intel NehalemAMD Barcelona

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions  Compilers alone achieves poor performance  Typically achieve a low fraction of peak performance  Exhibit little parallel scaling  Autotuning is essential to achieving good performance  1.9x-5.4x speedups across diverse architectures  Automatic tuning is necessary for scalability  With few exceptions, the same code was used  Ultimately, we are limited by the hardware  We can only do as well as Stream or in-core performance  The memory wall will continue to push stencil codes to be bandwidth- bound  When dealing with many small subproblems, fewer threads per subproblem performs best  However, load balancing becomes a major issue  This is an even larger problem for the manycore era 47

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Future Work  Better Productivity:  Current Perl scripts are primitive  Need to develop an auto-tuning framework that has semantic knowledge of the stencil code (S. Kamil)  Better Performance:  We currently do no data structure changes other than array padding  May be beneficial to store the grids in a recursive format using space- filling curves for better locality (S. Williams?)  Better Search:  Our current search method does require expert knowledge to order the optimizations appropriately  Machine learning offers the opportunity for tuning with little domain knowledge and many more parameters (A. Ganapathi) 48

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Acknowledgements  Kathy and Jim for sure-handed guidance and knowledge during all these years  Sam Williams for always being available to discuss research (and being an unofficial thesis reader)  Rajesh Nishtala for being a great friend and officemate  Jon Wilkening for being my outside thesis reader  The Bebop group, including Shoaib Kamil, Karl Fuerlinger, and Mark Hoemmen  The scientists at LBL, including Lenny Oliker, John Shalf, Jonathan Carter, Terry Ligocki, and Brain Van Straalen  The members of the Parlab and Radlab, including Dave Patterson and Archana Ganapathi  Many others that I don’t have space to mention here…  I’ll miss you all! Please contact me anytime. 49

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Supplemental Slides 50

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Applications of this work  Lawrence Berkeley Laboratory (LBL) is using stencil auto-tuning as a building block of its Green Flash supercomputer (Google: Green Flash LBL)  Dr. Franz-Josef Pfreundt (head of IT at Fraunhofer-ITWM) used stencil tuning to improve the performance of oil exploration code 51

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3D Helmholtz Kernel Problem 52