Download presentation
Presentation is loading. Please wait.
Published byAmice Paul Modified over 8 years ago
1
Multi-Grid Esteban Pauli 4/25/06
2
Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other Performance Performance Conclusion Conclusion
3
Problem Description Same input, output as Jacobi Same input, output as Jacobi Try to speed up algorithm by spreading boundary values faster Try to speed up algorithm by spreading boundary values faster Coarsen to small problem, successively solve, refine Coarsen to small problem, successively solve, refine Algorithm: Algorithm: 1.for i in 1.. levels - 1 2. coarsen level i to i + 1 3.for i in levels.. 2, -1 4. solve level i 5. refine level i to i – 1 6.solve level 1
4
Problem Description Coarsen Solve Refine Solve Refine
5
Implementation – Key Ideas Assign a chunk to each processor Assign a chunk to each processor Coarsen, refine operations done locally Coarsen, refine operations done locally Solve steps done like Jacobi Solve steps done like Jacobi
6
Shared Memory Implementations 1. for i in 1.. levels - 1 2. coarsen level i to i + 1 (in parallel) 3. barrier 4. for i in levels.. 2, -1 5. solve level i (in parallel) 6. refine level i to i – 1 (in parallel) 7. barrier 8. solve level 1 (in parallel)
7
Shared Memory Details Solve is like shared memory Jacobi – have true sharing Solve is like shared memory Jacobi – have true sharing 1./* my_ all locals*/ 2.for my_i = my_start_i.. my_end_i 3. for my_j = my_start_j.. my_end_j 4. current[my_i][my_j][level] = … Coarsen, Refine access only local – only false sharing possible Coarsen, Refine access only local – only false sharing possible 1.for my_i = my_start_i.. my_end_i 2. for my_j = my_start_j.. my_end_j 3. current[my_i][my_j][level] = …[level ± 1]
8
Shared Memory Paradigms Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) Being able to control distribution (CAF, GA) should help Being able to control distribution (CAF, GA) should help –If small enough, only have to worry about initial misses –If larger, will push out of cache, have to bring back over network –If have to switch to different syntax to access remote memory, it’s a minus on the “elegance” side, but a plus in that it makes communication explicit
9
Distributed Memory (MPI) Almost all work local, only communicate to solve a given level Almost all work local, only communicate to solve a given level Algorithm at each PE (looks very sequential): Algorithm at each PE (looks very sequential): 1.for i in 1.. levels - 1 2. coarsen level i to i + 1 // local 3.for i in levels.. 2, -1 4. solve level i // see next slide 5. refine level i to i – 1 // local 6.solve level 1 // see next slide
10
MPI Solve function “Dumb” “Dumb” 1.send my edges 2.receive edges 3.Compute Smarter Smarter 1.send my edges 2.compute middle 3.receive edges 4.compute boundaries Can do any other optimizations which can be done in Jacobi Can do any other optimizations which can be done in Jacobi
11
Distributed Memory (Charm++) Again, do like Jacobi Again, do like Jacobi Flow of control hard to show here Flow of control hard to show here Can send just one message to do all coarsening (like in MPI) Can send just one message to do all coarsening (like in MPI) Might get some benefits from overlapping computation and communication by waiting for smaller messages Might get some benefits from overlapping computation and communication by waiting for smaller messages No benefits from load balancing No benefits from load balancing
12
Other paradigms BSP model (local computation, global communication, barrier): good fit BSP model (local computation, global communication, barrier): good fit STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit Cilk (spawn processes for graph search): not a good fit Cilk (spawn processes for graph search): not a good fit
13
Performance Time (s) Speed- Up OpenMP0.67463.55 MPI10.684.01 Charm++ (no virt.) 22.361.92 Charm++ (4x virt.) 10.494.08 1024x1024 grid – 256x256 grid, 500 iterations at each level 1024x1024 grid – 256x256 grid, 500 iterations at each level Sequential time: 42.83 seconds Sequential time: 42.83 seconds Left table 4pes Left table 4pes Right table 16 pes Right table 16 pes Time (s) Speed- Up OpenMP1.2833.60 MPI2.6716.07 Charm++ (no virt.) 5.877.30 Charm++ (4x virt.) 2.6716.04
14
Summary Almost identical to Jacobi Almost identical to Jacobi Very predictable application Very predictable application Easy load balancing Easy load balancing Good for shared memory, MPI Good for shared memory, MPI Charm++: virtualization helps, probably need more data points to see if it can beat MPI Charm++: virtualization helps, probably need more data points to see if it can beat MPI DSM: false sharing might be too high a cost DSM: false sharing might be too high a cost Parallel paradigms for irregular programs not a good fit Parallel paradigms for irregular programs not a good fit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.