Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 252 Project Presentation

Similar presentations


Presentation on theme: "CS 252 Project Presentation"— Presentation transcript:

1 CS 252 Project Presentation
Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors Amik Singh ParLab, EECS CS 252 Project Presentation 05/04/2012 May the 4th be with you

2 Outline Introduction Experimental Setup Challenges Optimizations
Results & Conclusions Future Work

3 Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Multilevel technique to accelerate iterative solver convergence Conventional iterative solver operates on a grid at full resolution, require many iterations to converge Multigrid iterates towards convergence via a hierarchy of grid resolutions

4 Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Finer Level Coarser Level Coarsened grids damp out errors at large spatial frequencies Fine grids damp out high-frequency errors

5 Multigrid Method Multigrid method operates in what is called a V-cycle
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Multigrid method operates in what is called a V-cycle Consists of three main phases Smooth :- relaxation such as Jacobi or the Gauss-Seidel, Red-Black (GSRB) used in our study

6 Multigrid Method Smooth
Introduction Experimental Setup Challenges Optimizations Results Future Work Multigrid Method Smooth Restrict :- copy information from the finest grid to progressively coarsened grids Interpolate :- reverse of restrict, copy the correction from a coarse grid to a finer grid

7 Team Amik Singh Samuel Williams Dhiraj D. Kalamkar Brian Van Straalen
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Team Samuel Williams Brian Van Straalen Ann Almgren John Shalf Leonid Oliker Computational Research Division Lawrence Berkeley National Laboratory Dhiraj D. Kalamkar Anand M. Deshpande Mikhail Smelyanskiy Pradeep Dubey Intel Corporation Amik Singh

8 Different Architectures used
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures used

9 Different Architectures used
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures used

10 Problem Specification
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Problem Specification Au = f Variable-coefficient, finite volume discretization of canonical Helmholtz (Laplacian minus Identity) operator Right-hand side for our benchmarking is sin(∏x)sin(∏y)sin(∏z) on the [0,1] cubical domain Problem size fixed to 2563 discretization for time to solution comparison on different architectures

11 Smooth Pseudo-code Read in 7 arrays, write out 1 array
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Smooth Pseudo-code Read in 7 arrays, write out 1 array 25 flops per update Flops/Byte = 0.2 << 3.6 for GPUs

12 Challenges on GPU No SIMDization due to red-black updates
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Challenges on GPU No SIMDization due to red-black updates Very small shared memory (48 KB) Expensive inter thread block communication Red-Black Update Pattern

13 Baseline Implementation
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Baseline Implementation Only 1 ghost zone Communicate amongst different sub-domains after each smoothing operation

14 Baseline Implementation
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Baseline Implementation Only 1 ghost zone Communicate amongst different sub-domains after each smoothing operation

15 Communication Avoiding!
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work More ghost zones We do 2 red/black updates or 4 updates per smooth. Have 4 ghost zones Need not communicate after each update Communication Avoiding!

16 Wavefront Approach Introduction Experimental Setup Challenges
Optimizations Results & Conclusions Future Work Wavefront Approach

17 GPU Baseline vs GPU optimized
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work GPU Baseline vs GPU optimized Baseline Optimized Smooth (sec) 3.841 3.767 Restriction (sec) 0.031 0.029 Interpolation (sec) 0.071 0.069 Communication (sec) 1.928 0.886 Total (sec) 6.135 5.008

18 Different Architectures
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures

19 Conclusions CPU’s GPU’s
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Conclusions CPU’s GPU’s Hardware pre-fetchers decouple memory access through speculative loads Sufficient on chip-memory for communication avoiding implementations Parallelism achieved by multi-threaded paradigm Limited on-chip memory hamper realization of communication – avoiding benefits

20 Future Work Have a multi-GPU, MPI enabled implementation of the solver
Introduction Experimental Setup Challenges Optimizations Results Future Work Future Work Have a multi-GPU, MPI enabled implementation of the solver Explore the use of communication-avoiding techniques in matrix-free Krylov Subspace methods like BiCGstab for fast bottom solves

21 Submit to SC’12 tonight! Future Work
Introduction Experimental Setup Challenges Optimizations Results Future Work Future Work Have a multi-GPU, MPI enabled implementation of the solver Explore the use of communication-avoiding techniques in matrix-free Krylov Subspace methods like BiCGstab for fast bottom solves Submit to SC’12 tonight!

22 Thank You Questions?


Download ppt "CS 252 Project Presentation"

Similar presentations


Ads by Google