C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.

Slides:

Advertisements

Similar presentations

Instruction Set Design

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

CPU Review and Programming Models CT101 – Computing Systems.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

DARPA Mobile Autonomous Robot SoftwareMay Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1  2004 Morgan Kaufmann Publishers Chapter Seven.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Multi-Core Architectures

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

November 2006 SpaceOps 2010 Conference, Huntsville, Alabama By José Feiteirinha, Nuno Sebastião Date: April 2010 Emulation Using the CellBE Processor.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Funnel Sort*: Cache Efficiency and Parallelism

Sunpyo Hong, Hyesoon Kim

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

The University of Adelaide, School of Computer Science

Henk Corporaal TUEindhoven 2009

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Bojian Zheng CSCD70 Spring 2018

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 14: Reducing Cache Misses

Performance Optimization for Embedded Software

Henk Corporaal TUEindhoven 2011

Determining the Accuracy of Event Counts - Methodology

Presentation transcript:

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 1,2, John Shalf 2 and Katherine A. Yelick 1,2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory

BIPS What are stencil codes?  For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)  A stencil code updates every point in a regular grid with a weighted subset of its neighbors (“applying a stencil”) 2D Stencil 3D Stencil

BIPS Stencil Applications  Stencils are critical to many scientific applications:  Diffusion, Electromagnetics, Computational Fluid Dynamics  Both explicit and implicit iterative methods (e.g. Multigrid)  Both uniform and adaptive block-structured meshes  Many type of stencils  1D, 2D, 3D meshes  Number of neighbors (5- pt, 7-pt, 9-pt, 27-pt,…)  Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)  Our study focuses on 3D, 7-point, Jacobi iteration

BIPS Naïve Stencil Pseudocode (One iteration) void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); }

BIPS Potential Optimizations  Performance is limited by memory bandwidth and latency  Re-use is limited to the number of neighbors in a stencil  For large meshes (e.g., ), cache blocking helps  For smaller meshes, stencil time is roughly the time to read the mesh once from main memory  Tradeoff of blocking: reduces cache misses (bandwidth), but increases prefetch misses (latency)  See previous paper for details [Kamil et al, MSP ’05]  We look at merging across iterations to improve reuse  Three techniques with varying level of control  We vary architecture types  Significant work (not shown) on low level optimizations

BIPS Optimization Strategies Oblivious (Implicit) Conscious (Explicit) Software Cache (Implicit) Local Store (Explicit) Hardware Cache Oblivious Cache Conscious Cache Conscious on Cell N/A  Two software techniques  Cache oblivious algorithm recursively subdivides  Cache conscious has an explicit block size  Two hardware techniques  Fast memory (cache) is managed by hardware  Fast memory (local store) is managed by application software If hardware forces control, software cannot be oblivious

BIPS Opt. Strategy #1: Cache Oblivious Oblivious (Implicit) Conscious (Explicit) Software Cache (Implicit) Local Store (Explicit) Hardware Cache Oblivious Cache Conscious Cache Conscious on Cell N/A  Two software techniques  Cache oblivious algorithm recursively subdivides Elegant Solution No explicit block size No need to tune block size  Cache conscious has an explicit block size  Two hardware techniques  Cache managed by hw Less programmer effort  Local store managed by sw

BIPS Cache Oblivious Algorithm  By Matteo Frigo et al  Recursive algorithm consists of space cuts, time cuts, and a base case  Operates on well-defined trapezoid (x0, dx0, x1, dx1, t0, t1):  Trapezoid for 1D problem; our experiments are for 3D (shrinking cube) time space x1 x0 t1 t0 dx1 dx0

BIPS Cache Oblivious Algorithm - Base Case  If the height=1, then we have a line of points (x0:x1, t0):  At this point, we stop the recursion and perform the stencil on this set of points  Order does not matter since there are no inter-dependencies time space x1 x0 t1 t0

BIPS Cache Oblivious Algorithm - Space Cut  If trapezoid width >= 2*height, cut with slope=-1 through the center:  Since no point in Tr1 depends on Tr2, execute Tr1 first and then Tr2  In multiple dimensions, we try space cuts in each dimension before proceeding time space x1 x0 t1 t0 Tr1 Tr2

BIPS Cache Oblivious Algorithm - Time Cut  Otherwise, cut the trapezoid in half in the time dimension:  Again, since no point in Tr1 depends on Tr2, execute Tr1 first and then Tr2 x0 time space x1 t1 t0 Tr1 Tr2

BIPS Poor Itanium 2 Cache Oblivious Performance Cycle Comparison L3 Cache Miss Comparison  Fewer cache misses BUT longer running time

BIPS Poor Cache Oblivious Performance Power5 Cycle Comparison Opteron Cycle Comparison  Much slower on Opteron and Power5 too

BIPS Improving Cache Oblivious Performance  Fewer cache misses did NOT translate to better performance: ProblemSolution Extra function callsInlined kernel Poor prefetch behavior No cuts in unit-stride dimension Recursion stack overheadMaintain explicit stack Modulo OperatorPre-computed lookup array Recursion even after block fits in cache Early cut off of recursion

BIPS Cache Oblivious Performance  Only Opteron shows any benefit

BIPS Opt. Strategy #2: Cache Conscious Oblivious (Implicit) Conscious (Explicit) Software Cache (Implicit) Local Store (Explicit) Hardware Cache Oblivious Cache Conscious Cache Conscious on Cell N/A  Two software techniques  Cache oblivious algorithm recursively subdivides  Cache conscious has an explicit block size Easier to visualize Tunable block size No recursion stack overhead  Two hardware techniques  Cache managed by hw Less programmer effort  Local store managed by sw

BIPS  Like the cache oblivious algorithm, we have space cuts  However, cache conscious is NOT recursive and explicitly requires cache block dimension c as a parameter  Again, trapezoid for a 1D problem above Cache Conscious Algorithm time space x1 x0 t1 t0 dx1 dx0 Tr1Tr2 Tr3 c c c

BIPS Cache Blocking with Time Skewing Animation z (unit-stride) y x

BIPS Cache Conscious - Optimal Block Size Search

BIPS Cache Conscious - Optimal Block Size Search  Reduced memory traffic does correlate to higher GFlop rates

BIPS Cache Conscious Performance  Cache conscious measured with optimal block size on each platform  Itanium 2 and Opteron both improve

BIPS Creating the Performance Model GOAL: Find optimal cache block size without exhaustive search  Most important factors: memory traffic and prefetching  First count the number of cache misses  Inputs: cache size, cache line size, and grid size  Model then classifies block sizes into 5 cases  Misses are classified as either “fast” or “slow”  Then predict memory performance by factoring in prefetching  STriad microbenchmark determines cost of “fast” and “slow” misses  Combine with cache miss model to compute running time  If memory time is less than compute time, use compute time  Tells us we are compute-bound for that iteration

BIPS Memory Read Traffic Model G O O D

BIPS Performance Model G O O D

BIPS Performance Model Benefits  Avoids exhaustive search  Identifies performance bottlenecks  Allows us to tune appropriately  Eliminates poor block sizes  But, does not choose best block size (lack of accuracy)  Still need to do search over pruned parameter space

BIPS Opt. Strategy #3: Cache Conscious on Cell Oblivious (Implicit) Conscious (Explicit) Software Cache (Implicit) Local Store (Explicit) Hardware Cache Oblivious Cache Conscious Cache Conscious on Cell N/A  Two software techniques  Cache oblivious algorithm recursively subdivides  Cache conscious has an explicit block size Easier to visualize Tunable block size No recursion stack overhead  Two hardware techniques  Cache managed by hw  Local store managed by sw Eliminate extraneous reads/writes

BIPS Cell Processor  PowerPC core that controls 8 simple SIMD cores (“SPE”s)  Memory hierarchy consists of:  Registers  Local memory  External DRAM  Application explicitly controls memory:  Explicit DMA operations required to move data from DRAM to each SPE’s local memory  Effective for predictable data access patterns  Cell code contains more low-level intrinsics than prior code

BIPS Cell Local Store Blocking Stream out planes to target grid Stream in planes from source grid SPE local store

BIPS Excellent Cell Processor Performance  Double-Precision (DP) Performance: 7.3 GFlops/s  DP performance still relatively weak  Only 1 floating point instruction every 7 cycles  Problem becomes computation-bound when cache-blocked  Single-Precision (SP) Performance: 65.8 GFlops/s!  Problem now memory-bound even when cache-blocked  If Cell had better DP performance or ran in SP, could take further advantage of cache blocking

BIPS Summary - Computation Rate Comparison

BIPS Summary - Algorithmic Peak Comparison

BIPS Stencil Code Conclusions  Cache-blocking performs better when explicit  But need to choose right cache block size for architecture  Performance modeling can be very effective for this optimization  Software-controlled memory boosts stencil performance  Caters memory accesses to given algorithm  Works especially well due to predictable data access patterns  Low-level code gets closer to algorithmic peak  Eradicates compiler code generation issues  Application knowledge allows for better use of functional units

BIPS Future Work  Evaluate stencil performance on leading multi-core platforms and develop multi-core specific stencil optimizations  Implement auto-tuner for high-performance stencil codes  Confirm the usefulness of system via benchmarking/application performance

BIPS Publications  K. Datta, S. Kamil, S. Williams, L. Oliker. J. Shalf, K. Yelick, “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors”, SIAM Review, to appear.  S. Kamil, K. Datta, S, Williams, L. Oliker, J. Shalf, K. Yelick, "Implicit and Explicit Optimizations for Stencil Computations", Memory Systems Performance and Correctness (MSPC),  S. Kamil, P. Husbands, L. Oliker, J. Shalf, K. Yelick, "Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations", 3rd Annual ACM SIGPLAN Workshop on Memory Systems Performance (MSP) 2005