CS 252 Project Presentation

CS 252 Project Presentation
Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors Amik Singh ParLab, EECS CS 252 Project Presentation 05/04/2012 May the 4th be with you

Outline Introduction Experimental Setup Challenges Optimizations
Results & Conclusions Future Work

Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Multilevel technique to accelerate iterative solver convergence Conventional iterative solver operates on a grid at full resolution, require many iterations to converge Multigrid iterates towards convergence via a hierarchy of grid resolutions

Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Finer Level Coarser Level Coarsened grids damp out errors at large spatial frequencies Fine grids damp out high-frequency errors

Multigrid Method Multigrid method operates in what is called a V-cycle
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Multigrid method operates in what is called a V-cycle Consists of three main phases Smooth :- relaxation such as Jacobi or the Gauss-Seidel, Red-Black (GSRB) used in our study

Multigrid Method Smooth
Introduction Experimental Setup Challenges Optimizations Results Future Work Multigrid Method Smooth Restrict :- copy information from the ﬁnest grid to progressively coarsened grids Interpolate :- reverse of restrict, copy the correction from a coarse grid to a finer grid

Team Amik Singh Samuel Williams Dhiraj D. Kalamkar Brian Van Straalen
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Team Samuel Williams Brian Van Straalen Ann Almgren John Shalf Leonid Oliker Computational Research Division Lawrence Berkeley National Laboratory Dhiraj D. Kalamkar Anand M. Deshpande Mikhail Smelyanskiy Pradeep Dubey Intel Corporation Amik Singh

Different Architectures used
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures used

Problem Specification
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Problem Specification Au = f Variable-coefﬁcient, ﬁnite volume discretization of canonical Helmholtz (Laplacian minus Identity) operator Right-hand side for our benchmarking is sin(∏x)sin(∏y)sin(∏z) on the [0,1] cubical domain Problem size fixed to 2563 discretization for time to solution comparison on different architectures

Smooth Pseudo-code Read in 7 arrays, write out 1 array
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Smooth Pseudo-code Read in 7 arrays, write out 1 array 25 flops per update Flops/Byte = 0.2 << 3.6 for GPUs

Challenges on GPU No SIMDization due to red-black updates
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Challenges on GPU No SIMDization due to red-black updates Very small shared memory (48 KB) Expensive inter thread block communication Red-Black Update Pattern

Baseline Implementation
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Baseline Implementation Only 1 ghost zone Communicate amongst different sub-domains after each smoothing operation

Communication Avoiding!
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work More ghost zones We do 2 red/black updates or 4 updates per smooth. Have 4 ghost zones Need not communicate after each update Communication Avoiding!

Wavefront Approach Introduction Experimental Setup Challenges
Optimizations Results & Conclusions Future Work Wavefront Approach

GPU Baseline vs GPU optimized
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work GPU Baseline vs GPU optimized Baseline Optimized Smooth (sec) 3.841 3.767 Restriction (sec) 0.031 0.029 Interpolation (sec) 0.071 0.069 Communication (sec) 1.928 0.886 Total (sec) 6.135 5.008

Different Architectures
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures

Conclusions CPU’s GPU’s
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Conclusions CPU’s GPU’s Hardware pre-fetchers decouple memory access through speculative loads Sufficient on chip-memory for communication avoiding implementations Parallelism achieved by multi-threaded paradigm Limited on-chip memory hamper realization of communication – avoiding benefits

Future Work Have a multi-GPU, MPI enabled implementation of the solver
Introduction Experimental Setup Challenges Optimizations Results Future Work Future Work Have a multi-GPU, MPI enabled implementation of the solver Explore the use of communication-avoiding techniques in matrix-free Krylov Subspace methods like BiCGstab for fast bottom solves

Submit to SC’12 tonight! Future Work
Introduction Experimental Setup Challenges Optimizations Results Future Work Future Work Have a multi-GPU, MPI enabled implementation of the solver Explore the use of communication-avoiding techniques in matrix-free Krylov Subspace methods like BiCGstab for fast bottom solves Submit to SC’12 tonight!

Thank You Questions?

CS 252 Project Presentation

Similar presentations

Presentation on theme: "CS 252 Project Presentation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 252 Project Presentation

Similar presentations

Presentation on theme: "CS 252 Project Presentation"— Presentation transcript:

Similar presentations

About project

Feedback