Optimizing Stencil Computations March 18, 2013. Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
L16: Sorting and OpenGL Interface
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
L11: Jacobi, Tools and Project CS6963. Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
CS 732: Advance Machine Learning
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
03/30/031 ECE Digital System Design & Synthesis Lecture Design Partitioning for Synthesis Strategies  Partition for design reuse  Keep related.
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
CS/EE 217 – GPU Architecture and Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Associativity in Caches Lecture 25
Cache Memory Presentation I
CS/EE 217 – GPU Architecture and Parallel Programming
L4: Memory Hierarchy Optimization II, Locality and Data Placement
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory System Performance Chapter 3
Presentation transcript:

Optimizing Stencil Computations March 18, 2013

Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design Review Intermediate assessment of progress on project, oral and short In class on April 1 Final projects Poster session, April 24 (dry run April 22) Final report, May 1

Stencil Computations A stencil defines the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t.

07/20/11 Stencil Computations, Performance Issues Bytes per flop ratio is O(1) Most machines cannot supply data at this rate, leading to memory bound computation Some reuse, but difficult to exploit fully, and interacts with parallelization How to maximize performance: Avoid extraneous memory traffic, such as cache capacity/conflict misses Bandwidth optimizations to maximize utility of memory transfers Maximize in-core performance

07/20/11 Learn More: StencilProbe See Several variations of Heat Equation, to be discussed Can instantiate to measure performance impact

07/20/11 Example: Heat Equation for (t=0; t<timesteps; t++) { // time step loop for (k=1; k<nz-1; k++) { for (j=1; j<ny-1; j++) { for (i=1; i<nx-1; i++) { // 3-d 7-point stencil B[i][j][k] = A[i][j][k+1] + A[i][j][k-1] + A[i][j+1][k] + A[i][j-1][k] + A[i+1][j][k] + A[i-1][j][k] – 6.0 * A[i][j][k] / (fac*fac); } temp_ptr = A; A = B; B = temp_ptr; } What if nx, ny, nz large?What if nx, ny, nz large?

07/20/11 Heat Equation, Add Tiling for (t=0; t<timesteps; t++) { // time step loop for (jj = 1; jj < ny-1; jj+=TJ) { for (ii = 1; ii < nx - 1; ii+=TI) { for (k=1; k<nz-1; k++) { for (j = jj; j < MIN(jj+TJ,ny - 1); j++) { for (i = ii; i < MIN(ii+TI,nx - 1); i++) { // 3-d 7-point stencil B[i][j][k] = A[i][j][k+1] + A[i][j][k-1] + A[i][j+1][k] + A[i][j-1][k] + A[i+1][j][k] + A[i-1][j][k] – 6.0 * A[i][j][k] / (fac*fac); } temp_ptr = A; A = B; B = temp_ptr; } Note the reuse across time steps!

07/20/11 for (kk=1; kk < nz-1; kk+=tz) { for (jj = 1; jj < ny-1; jj+=ty) { for (ii = 1; ii < nx - 1; ii+=tx) { for (t=0; t<timesteps; t++) { // time step loop … calculate bounds from t and slope … for (k=blockMin_z; k < blockMax_z; k++) { for (j=blockMin_y; j < blockMax_y; j++) { for (i=blockMin_x; i < blockMax_x; i++) { // 3-d 7-point stencil B[i][j][k] = A[i][j][k+1] + A[i][j][k-1] + A[i][j+1][k] + A[i][j-1][k] + A[i+1][j][k] + A[i-1][j][k] – 6.0 * A[i][j][k] / (fac*fac); } temp_ptr = A; A = B; B = temp_ptr; } Heat Equation, Time Skewing

07/20/11 Heat Equation, Circular Queue See probe_heat_circqueue.c

07/20/11 Heat Equation, Cache Oblivious See probe_heat_oblivious.c Idea: Recursive decomposition to cutoff point Implicit tiling: in both space and time Simpler code than complex tiling, but introduces overhead Encapsulated in Pochoir DSL (next slide) Space cut Time cut

Example Pochoir Stencil Compiler Specification

Parallel Stencils in Pochoir

General Approach to Parallel Stencils Always safe to parallelize within a time step Circular queue and time skewing encapsulate “tiles” that are independent

07/20/11 Results for Heat Equation Reference: K. Datta et al., "Stencil Computation Optimization and Autotuning on State ‐ of ‐ the ‐ Art Multi-core Architectures”, SC08.

What about GPUs? Two recent papers: “Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters,” Zhang and Mueller, CGO “High-Performance Code Generation for Stencil Computations on GPU Architectures,” Holewinski et al., ICS Key issues: —Exploit reuse in shared memory. —Avoid fetching from global memory. —Thread decomposition to support global memory coalescing.

Overlapped Tiling for Halo Regions (or Ghost Zones) Input data exceeds output result (as in Sobel) Halo region or ghost zone extends the per-thread data decomposition to encompass additional input An (n+2)x(n+2) halo region is needed to compute an nxn block if subscript expressions are of the form ± 1, for example By expanding the halo region, we can trade off redundant computation for reduced accesses to global memory when parallelizing across time steps. Computed region Halo region Extra Redundant computation 2-d 5-point stencil example

2.5D Decomposition Partition such that each thread block sweeps over the z- axis and processes one plane at a time.

07/20/11 Resulting Code

Other Optimizations X dimension delivers coalesced global memory accesses Pad to multiples of 32 stencil elements Halo regions are aligned to 128-bit boundaries Input (parameter) arrays are padded to match halo region, to share indexing. BlockSize.x is maximized to avoid non-coalesced accesses to halo region Blocks are square to reduce area of redundancy. Use of shared memory for input. Use of texture fetch for input.

Performance Across Stencils and Architectures

GPU Cluster Performance