Download presentation
Presentation is loading. Please wait.
Published byAgnes Gaines Modified over 9 years ago
1
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber, S. Thite, J. Palaniappan This research made possible via NSF grant DMR 01-21695 http://charm.cs.uiuc.edu
2
Parallel Programming Lab Led by Professor Laxmikant Kale Application-oriented –Research is driven by real applications and the needs of real applications NAMD CSAR Rocket Simulation (Roc*) Spacetime Discontinuous Galerkin Petaflops Performance Prediction (Blue Gene) –Focus on scaleable performance for real applications http://charm.cs.uiuc.edu
3
Charm++ Overview In development for roughly ten years Based on C++ Runs on many platforms –Desktops –Clusters –Supercomputers Overlays a C layer called Converse –Allows multiple languages to work together http://charm.cs.uiuc.edu
4
Charm++: Programmer View System of objects Asynchronous communication via method invocation Use an object identifier to refer to an object. User sees each object execute its methods atomically –As if on its own processor Processor Object/Task http://charm.cs.uiuc.edu
5
Charm++: System View Set of objects invoked by messages Set of processors of the physical machine Keeps track of object to processor mapping Routes messages between objects Processor Object/Task http://charm.cs.uiuc.edu
6
Charm++ Benefits Program is not tied to a fixed number of processors –No problem if program needs 128 processors and only 45 available –Called processor virtualization Load balancing accomplished automatically –User writes a short routine to transfer object between processors http://charm.cs.uiuc.edu
7
Load Balancing - Green Process Starts Heavy Computation http://charm.cs.uiuc.edu A B C
8
Yellow Processes Migrate Away – System Handles Message Routing http://charm.cs.uiuc.edu A B C A B C
9
Load Balancing Load balancing isn’t solely dependant on CPU usage Balancers consider network usage as well –Can move objects to lessen network bandwidth usage Migrating an object to disk instead of another processor gives checkpoint/restart, out-of-core execution http://charm.cs.uiuc.edu
10
Parallel Spacetime Discontinuous Galerkin Mesh generation is an advancing front algorithm –Adds an independent set of elements called patches to the mesh Spacetime methods are setup in such a way they are easy to parallelize –Each patch depends only on inflow elements Cone constraint insures no other dependencies –Amount of data per patch is small Inexpensive to send a patch and its inflow elements to another processor http://charm.cs.uiuc.edu
11
Mesh Generation Unsolved Patches
12
Mesh Generation Solved Patches Unsolved Patches
13
Mesh Generation Solved Patches Unsolved Patches Refinement
14
Parallelization Method (1D) Master-Slave method –Centralized mesh generation –Distributed physics solver code –Simplistic implementation But fast to get running Provides object migration sanity check No “time-step” –as soon as a patch returns the master generates any new patches it can and sends them off to be solved http://charm.cs.uiuc.edu
15
Results - Patches / Second http://charm.cs.uiuc.edu
16
Scaling Problems Speedup is ideal at 4 slave processors After 4 slaves, diminishing speedup occurs Possible sources: –Network bandwidth overload –Charm++ system overhead (grainsize control) –Mesh generator overload Problem doesn’t scale-down –More processors don’t slow the computation down http://charm.cs.uiuc.edu
17
Network Bandwidth Size of a patch to send both ways is 2048 bytes (very conservative estimate) Can compute 36 patches/(second*CPU) Each CPU needs 72kbytes/second 100Mbit Ethernet provides 10Mbyte/sec Network can support ~130 CPUs –Must not be a lack of network bandwidth http://charm.cs.uiuc.edu
18
Charm++ System Overhead (Grainsize Control) Grainsize is a measure of the smallest unit of work Too small and overhead dominates –Network latency overhead –Object creation overhead Each patch takes 1.7ms to setup the connection to send (both ways) Can send ~550 patches/sec to remote processors –Again, higher than observed patch/second rate Grainsize can be reduced by sending multiple patches at once –Speeds up the computation but speedup still flattens out after 8 processors http://charm.cs.uiuc.edu
19
Mesh Generation With 0 slave processors, 31ms/patch With 1 slave processor, 27ms/patch Geometry code takes 4ms to generate a patch –Mesh generator needs a bit more time due to Charm++ message sending overhead Leads to less than 250 patches/second Can’t trivially speed this up –Would have to parallelize mesh generation –Parallel mesh generation also would lighten network load if the mesh were fully distributed to slave nodes http://charm.cs.uiuc.edu
20
Testing the Mesh Generator Bottleneck Does speeding up the mesh generator give better results? Leaves the question how to speed up the mesh generator –The cluster used is a P3 Xeon 500Mhz –So run the mesh generator on something faster (a P4 2.8Ghz) –Everything still on 100Mbit network
21
Fast Mesh Generator Results
22
Future Directions Parallelize geometry/mesh generation –Easy to do in theory –More complex in practice with refinement, coarsening –Lessens network bandwidth consumption Only have to send border elements of all meshes Compared to all elements sent right now –Better cache performance http://charm.cs.uiuc.edu
23
More Future Directions Send only necessary data –Currently send everything, needed or not Use migration to balance load rather than slaves –Means we’ll also get checkpoint/restart and out-of- core execution for free –Also means we can load balance away some of the network communication Integrate 2D mesh generation/physics code –Nothing in the parallel code knows the dimensionality http://charm.cs.uiuc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.