Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.

Slides:

Advertisements

Similar presentations

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Distributed Systems CS

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Reference: Message Passing Fundamentals.

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu Faculty Mentor: Dr. Nancy Amato Supervisor: Dr. Mauro Bianco.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Bit Complexity of Breaking and Achieving Symmetry in Chains and Rings.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

ParAlign: with OpenMP Peter Reetz. Overview n Simple algorithm for finding un-gapped alignments n 4 to 5 times faster than Smith- Waterman algorithm &

Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.

TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Chapter 7 File I/O 1. File, Record & Field 2 The file is just a chunk of disk space set aside for data and given a name. The computer has no idea what.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Parallel Solution of the Poisson Problem Using MPI

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Parallel and Distributed Simulation Time Parallel Simulation.

Data Structures and Algorithms in Parallel Computing Lecture 4.

Data Structures and Algorithms in Parallel Computing

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

Region-Based Software Distributed Shared Memory Song Li, Yu Lin, and Michael Walker CS Operating Systems May 1, 2000.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Jeremy Martin Alex Tiskin

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Large-scale file systems and Map-Reduce

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Parallel Programming By J. H. Wang May 2, 2017.

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Parallel Density-based Hybrid Clustering

Bruhadeshwar Meltdown Bruhadeshwar

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

L21: Putting it together: Tree Search (Ch. 6)

Data Structures and Algorithms in Parallel Computing

Distributed Shared Memory

Artificial Intelligence

Hybrid Programming with OpenMP and MPI

Parallel Programming in C with MPI and OpenMP

An Orchestration Language for Parallel Objects

Higher Level Languages on Adaptive Run-Time System

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Multi-Grid Esteban Pauli 4/25/06

Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other Performance Performance Conclusion Conclusion

Problem Description Same input, output as Jacobi Same input, output as Jacobi Try to speed up algorithm by spreading boundary values faster Try to speed up algorithm by spreading boundary values faster Coarsen to small problem, successively solve, refine Coarsen to small problem, successively solve, refine Algorithm: Algorithm: 1.for i in 1.. levels coarsen level i to i for i in levels.. 2, solve level i 5. refine level i to i – 1 6.solve level 1

Problem Description Coarsen Solve Refine Solve Refine

Implementation – Key Ideas Assign a chunk to each processor Assign a chunk to each processor Coarsen, refine operations done locally Coarsen, refine operations done locally Solve steps done like Jacobi Solve steps done like Jacobi

Shared Memory Implementations 1. for i in 1.. levels coarsen level i to i + 1 (in parallel) 3. barrier 4. for i in levels.. 2, solve level i (in parallel) 6. refine level i to i – 1 (in parallel) 7. barrier 8. solve level 1 (in parallel)

Shared Memory Details Solve is like shared memory Jacobi – have true sharing Solve is like shared memory Jacobi – have true sharing 1./* my_ all locals*/ 2.for my_i = my_start_i.. my_end_i 3. for my_j = my_start_j.. my_end_j 4. current[my_i][my_j][level] = … Coarsen, Refine access only local – only false sharing possible Coarsen, Refine access only local – only false sharing possible 1.for my_i = my_start_i.. my_end_i 2. for my_j = my_start_j.. my_end_j 3. current[my_i][my_j][level] = …[level ± 1]

Shared Memory Paradigms Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) Being able to control distribution (CAF, GA) should help Being able to control distribution (CAF, GA) should help –If small enough, only have to worry about initial misses –If larger, will push out of cache, have to bring back over network –If have to switch to different syntax to access remote memory, it’s a minus on the “elegance” side, but a plus in that it makes communication explicit

Distributed Memory (MPI) Almost all work local, only communicate to solve a given level Almost all work local, only communicate to solve a given level Algorithm at each PE (looks very sequential): Algorithm at each PE (looks very sequential): 1.for i in 1.. levels coarsen level i to i + 1 // local 3.for i in levels.. 2, solve level i // see next slide 5. refine level i to i – 1 // local 6.solve level 1 // see next slide

MPI Solve function “Dumb” “Dumb” 1.send my edges 2.receive edges 3.Compute Smarter Smarter 1.send my edges 2.compute middle 3.receive edges 4.compute boundaries Can do any other optimizations which can be done in Jacobi Can do any other optimizations which can be done in Jacobi

Distributed Memory (Charm++) Again, do like Jacobi Again, do like Jacobi Flow of control hard to show here Flow of control hard to show here Can send just one message to do all coarsening (like in MPI) Can send just one message to do all coarsening (like in MPI) Might get some benefits from overlapping computation and communication by waiting for smaller messages Might get some benefits from overlapping computation and communication by waiting for smaller messages No benefits from load balancing No benefits from load balancing

Other paradigms BSP model (local computation, global communication, barrier): good fit BSP model (local computation, global communication, barrier): good fit STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit Cilk (spawn processes for graph search): not a good fit Cilk (spawn processes for graph search): not a good fit

Performance Time (s) Speed- Up OpenMP MPI Charm++ (no virt.) Charm++ (4x virt.) x1024 grid – 256x256 grid, 500 iterations at each level 1024x1024 grid – 256x256 grid, 500 iterations at each level Sequential time: seconds Sequential time: seconds Left table 4pes Left table 4pes Right table 16 pes Right table 16 pes Time (s) Speed- Up OpenMP MPI Charm++ (no virt.) Charm++ (4x virt.)

Summary Almost identical to Jacobi Almost identical to Jacobi Very predictable application Very predictable application Easy load balancing Easy load balancing Good for shared memory, MPI Good for shared memory, MPI Charm++: virtualization helps, probably need more data points to see if it can beat MPI Charm++: virtualization helps, probably need more data points to see if it can beat MPI DSM: false sharing might be too high a cost DSM: false sharing might be too high a cost Parallel paradigms for irregular programs not a good fit Parallel paradigms for irregular programs not a good fit