10/10/2002Yun (Helen) He, GHC20021 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures.

Slides:



Advertisements
Similar presentations
Design of a reliable communication system for grid-style traffic light networks Junghoon Lee Dept. of Computer science and statistics Jeju National University.
Advertisements

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
1 King ABDUL AZIZ University Faculty Of Computing and Information Technology CS 454 Computer graphics Drawing Elementary Figures Dr. Eng. Farag Elnagahy.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.
June 2003Yun (Helen) He1 Coupling MM5 with ISOLSM: Development, Testing, and Application W.J. Riley, H.S. Cooley, Y. He*, M.S. Torn Lawrence Berkeley National.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 Effect of Increasing Chip Density on the Evolution of Computer Architectures R. Nair IBM Journal of Research and Development Volume 46 Number 2/3 March/May.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Accelerating the Optimization in SEE++ Presentation at RISC, Hagenberg Johannes Watzl 04/27/2006 Cooperation Project by RISC and UAR.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Implementation of 2D FDTD
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
1 A Domain Decomposition Analysis of a Nonlinear Magnetostatic Problem with 100 Million Degrees of Freedom H.KANAYAMA *, M.Ogino *, S.Sugimoto ** and J.Zhao.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.
Parallel Simulation of Continuous Systems: A Brief Introduction
Policy Enabled Handoff Across Heterogeneous Wireless Networks Helen W and Jochen Giese Computer Science, Berkeley.
The Truth About Parallel Computing: Fantasy versus Reality William M. Jones, PhD Computer Science Department Coastal Carolina University.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Parallelization of 2D Lid-Driven Cavity Flow
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallel Solution of the Poisson Problem Using MPI
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Exascale climate modeling 24th International Conference on Parallel Architectures and Compilation Techniques October 18, 2015 Michael F. Wehner Lawrence.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Formulation of 2D‐FDTD without a PML.
Ghost-Cell method NFRI 16/07/2014.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Parallel Computing Presented by Justin Reschke
3/23/05ME 2591 Numerical Methods in Heat Conduction Reference: Incropera & DeWitt, Chapter 4, sections Chapter 5, section 5.9.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Primitive graphic objects
These slides are based on the book:
High Altitude Low Opening?
Parallel Programming in C with MPI and OpenMP
NERSC, LBNL 08/07/2002 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures Chris.
CS 252 Project Presentation
COMP60621 Designing for Parallelism
Parallel Programming in C with MPI and OpenMP
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
Presentation transcript:

10/10/2002Yun (Helen) He, GHC20021 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002Yun (Helen) He, GHC Outline Introduction Traditional Method Ghost Cell Expansion (GCE) Method GCE Algorithm Diagonal Communication Elimination (DCE) Technique Analysis of GCE Method Message Volume Communication Time Memory Usage and Computational Cost Performance Test Problem Test Results Conclusions

10/10/2002Yun (Helen) He, GHC Background On a distributed system, each processor holds a problem subdomain, which includes one or several boundary layers (usually called ghost cells). Ghost cells contain the most recent values of the neighboring processors. They must be updated at each time step. And it is done so in traditional approach. The message sizes exchanged between processors are usually very small. To speedup the communication, one idea is to combine messages for different time steps and exchange the bigger messages less frequently to reduce the communication latency.

10/10/2002Yun (Helen) He, GHC Traditional Field Update L = 2 L = 2 ghost | active | ghost ghost | active | ghost t: O O | O O O O | O O t+1: | x x x x | L = 1 L = 1 ghost | active | ghost ghost | active | ghost t: O | O O O O | O t: O | O O O O | O t+1: | x x x x | t+1: | x x x x | Where L is the number of layers required depends on the order of accuracy of numerical discretization

10/10/2002Yun (Helen) He, GHC Ghost Cell Expansion (GCE) Method L = 2, e= 3 L = 2, e= 3 ghost | active | ghost ghost | active | ghost t: O O O O O | O O O O | O O O O O t: O O O O O | O O O O | O O O O O t+1: S x x x | x x x x | x x x S t+1: S x x x | x x x x | x x x S t+2: S x x | x x x x | x x S t+2: S x x | x x x x | x x S t+3: S x | x x x x | x S t+3: S x | x x x x | x S t+4: | x x x x | t+4: | x x x x | L = 1, e= 3 L = 1, e= 3 ghost | active | ghost ghost | active | ghost t: O O O O | O O O O | O O O O t: O O O O | O O O O | O O O O t+1: x x x | x x x x | x x x t+1: x x x | x x x x | x x x t+2: x x | x x x x | x x t+2: x x | x x x x | x x t+3: x | x x x x | x t+3: x | x x x x | x t+4: | x x x x | t+4: | x x x x | where e is the where e is the expansion level expansion level

10/10/2002Yun (Helen) He, GHC Algorithm for GCE Method do istep = 1, total_steps do istep = 1, total_steps j = mod (istep-1,e+1) j = mod (istep-1,e+1) if (j == 0) update ghost_cells if (j == 0) update ghost_cells y_start = 1 - e + j y_start = 1 - e + j y_end = ny + e - j y_end = ny + e - j x_start = 1 - e + j x_start = 1 - e + j x_end = nx + e – j x_end = nx + e – j if subdomain touches real boundaries if subdomain touches real boundaries set y_start, y_end to 1 or ny set y_start, y_end to 1 or ny set x_start, x_end to 1 or nx set x_start, x_end to 1 or nx endif endif do iy = y_start, y_end do iy = y_start, y_end do ix = x_start, x_end do ix = x_start, x_end update field (ix, iy) update field (ix, iy) enddo enddo

10/10/2002Yun (Helen) He, GHC Diagonal Communication Elimination (DCE) Technique A regular grid in 2D decomposition has 4 immediate neighbors (left and right), and 4 second nearest neighbors (diagonal). A regular grid in 3D decomposition has 6 immediate neighbors (horizontal and vertical), 12 second nearest neighbors (planar corner), and 8 third nearest neighbors (cubic corner).

10/10/2002Yun (Helen) He, GHC Diagonal Communication Elimination (DCE) Technique (cont’d) Send blocks 5, 7 and 3 right; Send blocks 6, 8 and 4 left; Processor owns block 1 also has blocks 5 and 6; Processors owns block 2 also has blocks 7 and 8; Send blocks 5, 6 and 1 down; Send blocks 7, 8 and 2 up; Active domain updated correctly!

10/10/2002Yun (Helen) He, GHC Message Volume Traditional: GCE Method: Ratio: L = 2, e = 4  ratio = 3/5 L = 2, e = 4  ratio = 3/5

10/10/2002Yun (Helen) He, GHC Communication Time Traditional: GCE Method: L=2, e=8, Nx = Ny = 800: L=2, e=8, Nx = Ny = 800: IBM SP: B=133 MB/sec, T L =26  sec  ratio = 2.15 Cray T3E: B=300 MB/sec, T L =17  sec  ratio = 2.31

10/10/2002Yun (Helen) He, GHC Theoretical Speedup

10/10/2002Yun (Helen) He, GHC Memory Usage and Computational Cost Decomposition  M/M  C/C 1D 2e/Nx e/2Nx 2D 2e/Nx+2e/Ny e/2Nx+e/2Ny 3D 2e/Nx+2e/Ny+2e/Nz e/2Nx+e/2Ny+e/2Nz e = 4, Nx = Ny = 800  M/M = 2%, C/C = 0.5% e = 4, Nx = Ny = 800   M/M = 2%,  C/C = 0.5%

10/10/2002Yun (Helen) He, GHC Test Problem 2D Laplacian Equation with Dirichlet boundary conditions: With 2nd-order accuracy, using 5-point stencils: With 4th-order accuracy, using 9-point stencils:

10/10/2002Yun (Helen) He, GHC Performance Analysis Test 1: Global size 3200x3200, P=16, L=1 Test 2: Global size 3200x3200, P=16, L=2 Test 3: Global size 6400x6400, P=64, L=1 Test 4: Global size 6400x6400, P=64, L=2 Local domain size is always 800x800

10/10/2002Yun (Helen) He, GHC Performance on IBM SP

10/10/2002Yun (Helen) He, GHC Performance on IBM SP (cont’d)

10/10/2002Yun (Helen) He, GHC Performance on IBM SP (cont’d)

10/10/2002Yun (Helen) He, GHC Performance on Cray T3E

10/10/2002Yun (Helen) He, GHC Performance on Cray T3E (cont’d)

10/10/2002Yun (Helen) He, GHC Performance on Cray T3E (cont’d)

10/10/2002Yun (Helen) He, GHC Conclusions With the new ghost cell expansion method, the frequency to update processor subdomain boundaries is reduced from every time step to once per e+1 time steps. Both the total message volume and total number of messages are reduced, thus the total communication time. The overhead of memory usage and computational cost are minimal. Diagonal Communication Elimination technique reduces the total number of messages exchanged between processors from 8 to 4 in 2D domain decomposition and 26 to 6 in 3D domain decomposition. The systematic experiments on IBM SP and Cray T3E both have communication speedup up to 170%.