Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Conclusion of Communication Optimizations and start of Load Balancing Techniques CS320 Spring 2003 Laxmikant Kale Parallel Programming.

Similar presentations


Presentation on theme: "1 Conclusion of Communication Optimizations and start of Load Balancing Techniques CS320 Spring 2003 Laxmikant Kale Parallel Programming."— Presentation transcript:

1 1 Conclusion of Communication Optimizations and start of Load Balancing Techniques CS320 Spring 2003 Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

2 2 Asynchronous reductions: Jacobi Convergence check –At the end of each Jacobi iteration, we do a convergence check –Via a scalar Reduction (on maxError) But note: –each processor can maintain old data for one iteration So, use the result of the reduction one iteration later! –Deposit of reduction is separated from its result. –MPI_Ireduce(..) returns a handle (like MPI_Irecv) And later, MPI_Wait(handle) will block when you need to.

3 3 Asynchronous reductions in Jacobi compute reduction compute reduction Processor timeline with sync. reduction Processor timeline with async. reduction This gap is avoided below

4 4 Performance tool snapshots From ab initio molecular dynamics application –(PPL in collaboration with Glenn Martyna and Mark Tuckerman)

5 5 Utilization Graph

6 6 Profile view: Processors on x-axis, stacked bar-chart of time spent for each

7 7 Overview: Processors on y-axis, time along x axis, white: busy, black: idle

8 8 Timeline view:

9 9

10 10

11 11 Load Balancing Techniques CS320 Spring 2003 Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

12 12 How to diagnose load imbalance? Often hidden in statements such as: –MPI_barrier is too slow –MPI_reduce is too slow –Very high synchronization overhead Most processors are waiting at a reduction Count total amount of computation (ops/flops) per processor –In each phase! –Because the balance may change from phase to phase

13 13 Golden Rule of Load Balancing Golden Rule: It is ok if a few processors idle, but avoid having processors that are overloaded with work Finish time = max{Time on I’th processor}Excepting data dependence and communication overhead issues Example: 50,000 tasks of equal size, 500 processors: A: All processors get 99, except last 5 gets 100+99 = 199 OR, B: All processors have 101, except last 5 get 1 Fallacy: objective of load balancing is to minimize variance in load across processors Identical variance, but situation A is much worse!

14 14 Amdahls’s Law and grainsize Before we get to load balancing: Original “law”: –If a program has K % sequential section, then speedup is limited to 100/K. If the rest of the program is parallelized completely Grainsize corollary: –If any individual piece of work is > K time units, and the sequential program takes T seq, Speedup is limited to T seq / K So: –Examine performance data via histograms to find the sizes of remappable work units –If some are too big, change the decomposition method to make smaller units

15 15 Grainsize Example: Molecular Dynamics In Molecular Dynamics Program NAMD: –While trying to scale it to 2000 processors –Sequential step time was 57 seconds –To run on 2000 processors, no object should be more than 28 msecs. –Analysis using projections showed the following histogram:

16 16 Grainsize analysis via Histograms Solution: Split compute objects that may have too much work: using a heuristic based on number of interacting atoms Problem

17 17 Grainsize reduced

18 18 Grainsize: LeanMD for Blue Gene/L BG/L is a planned IBM machine with 128k processors Here, we need even more objects: –Generalize hybrid decomposition scheme 1-away to k-away 2-away : cubes are half the size.

19 19 5000 vps 76,000 vps 256,000 vps

20 20 Load Balancing Strategies Classified by when it is done: –Initially –Dynamic: Periodically –Dynamic: Continuously Classified by whether decisions are taken with global information –Fully centralized Quite good a choice when load balancing period is high –Fully distributed Each processor knows only about a constant number of neighbors Extreme case: totally local decision (send work to a random destination processor, with some probability). –Use aggregated global information, and detailed neighborhood info.

21 21 Load Balancing: Unrestricted Exchange This is an initial OR periodic strategy Each processor reads (or has) N i particles Before doing interesting things with the data, we want to distribute it equally across processors It doesn’t matter where each piece of data goes –No constraints Issues: –How to decide who sends data to whom –How to minimize communication overhead in the process

22 22 Balancing number of data items: contd Find the average (avg) using a reduction –Each processor now knows if they are above or below avg –Collect this information (load vector) globally Then: –Sort all donors (L i > avg) by decreasing Li –Sort all the receivers (L i < avg) by decreasing need: (avg – L i ) –For each donor: assign the destination for its extra data Using the largest-need receiver first. –This tends to produce the fewest number of messages But only as a heuristics –Each processor can replicate this calculation! Assuming each received the load vector No need to broadcast results

23 23 Balancing using Dimensional Exchange Log P phases: exchange info and then data with each neighbor –Send message saying how many items you have –Compare your number with neighbor’s Calculate average Send overage to them –Load is balanced at the end of log P phase In each phase, two halves are perfectly balanced After first phase, the two planes above are equally loaded –No need to return to exchanging data across planes (via red)

24 24 Dynamic Load Balancing Scenarios: Examples representing typical classes of situations –Particles distributed over simulation space Dynamic: because Particles move. Cases: –Highly non-uniform distribution (cosmology) –Relatively Uniform distribution –Structured grids, with dynamic refinements/coarsening –Unstructured grids with dynamic refinements/coarsening

25 25 Example Case: Particles Orthogonal Recursive Bisection (ORB) –At each stage: divide Particles equally –Processor don’t need to be a power of 2: Divide in proportion –2:3 with 5 processors –How to choose the dimension along which to cut? Choose the longest one –How to draw the line? All data on one processor? Sort along each dimension Otherwise: run a distributed histogramming algorithm to find the line, recursively –Find the entire tree, and then do all data movement at once Or do it in two-three steps. But no reason to redistribute particles after drawing each line.

26 26 Particles: Oct/Quad Trees In ORB, each chunk has a brick shape, with non-square aspect ratio –Oct trees (Quad in 2D) lead to cubic boxes How to distribute particle-data into Oct trees? –Assume data is distributed (randomly) –Build a small top level tree across processors 2 or 3 deep –Send particles to their box Let each box create children if it has more than a threshold number of particles and send particles to them. Continue recursively Note the tree is non-uniform (unlike ORB)

27 27 Particles: Space-filling curves Sort all particles using a key that mixes x, y and z coordinates –So particles with similar values for most significant bits of X,Y,Z coordinates are clustered together. Snip this linearized list into equal size chunks This is almost like an Oct-tree, –Except nearby boxes have been collected together, for load balance –First 3k bits are identical: belong to the same oct-tree node at the k’th level. But: –Sorting is relatively expensive to do every time –Partitions don’t have a regular shape –Because the space-filling curve jumps around, no real guarantee of communication minimization

28 28 Particles: Virtualization You can apply virtualization to all the above methods: –It becomes a two level strategy –Particles are grouped into a large number of boxes Much more than P Cubes (oct-tree) or bricks (ORB) –The “system” maps these boxes to processors Advantages: –You can use higher tolerance for imbalance (both oct and orb) during tree formation –Particles can migrate among existing boxes, and load balancing can be done by just moving boxes across processor With a lower load balancing overhead Less frequently, you can re-form the tree, if needed –You can also locally split and coarsen it

29 29 Structured and Unstructured Grids/Meshes Similar considerations apply to these –Libraries like Metis partition Unstructured Meshes –ORB, Spacefilling curves are options for structured grids Virtualization: –Again, virtualization helps by reducing the cost of load balancing Use any scheme to partition data into large number of chunks Use a dynamic load balancer to map chunks to procs –It can also decide If communication costs are significant or not, and Tune itself to communication patterns better.

30 30 Dynamic Load Balancing using Objects Object based decomposition (I.e. virtualized decomposition) helps –Allows RTS to remap them to balance load –But how does the RTS decide where to map objects? –Just move objects away from overloaded processors to underloaded processors Just??

31 31 Measurement Based Load Balancing Principle of persistence –Object communication patterns and computational loads tend to persist over time –In spite of dynamic behavior Abrupt but infrequent changes Slow and small changes Runtime instrumentation –Measures communication volume and computation time Measurement based load balancers –Use the instrumented data-base periodically to make new decisions –Many alternative strategies can use the database

32 32 Periodic Load balancing Strategies Stop the computation? Centralized strategies: –Charm RTS collects data (on one processor) about: Computational Load and Communication for each pair –If you are not using AMPI/Charm, you can do the same instrumentation and data collection –Partition the graph of objects across processors Take communication into account –Pt-to-pt, as well as multicast over a subset –As you map an object, add to the load on both sending and receiving processor The red communication is free, if it is a multicast.

33 33 Object partitioning strategies You can use graph partitioners like METIS, K-R –BUT: graphs are smaller, and optimization criteria are different Greedy strategies –If communication costs are low: use a simple greedy strategy Sort objects by decreasing load Maintain processors in a heap (by assigned load) In each step: –assign the heaviest remaining object to the least loaded processor –With small-to-moderate communication cost: Same strategy, but add communication costs as you add an object to a processor –Always add a refinement step at the end: Swap work from heaviest loaded processor to “some other processor” Repeat a few times or until no improvement

34 34 Object partitioning strategies When communication cost is significant: –Still use greedy strategy, but: At each assignment step, choose between assigning O to least loaded processor and the processor that already has objects that communicate most with O. –Based on the degree of difference in the two metrics –Two-stage assignments: »In early stages, consider communication costs as long as the processors are in the same (broad) load “class”, »In later stages, decide based on load Branch-and-bound –Searches for optimal, but can be stopped after a fixed time

35 35 Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle As computation progresses, crack propagates, and new elements are added, leading to more complex computations in some chunks

36 36 Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked

37 37 Distributed Load balancing Centralized strategies –Still ok for 3000 processors for NAMD Distributed balancing is needed when: –Number of processors is large and/or –load variation is rapid Large machines: –Need to handle locality of communication Topology sensitive placement –Need to work with scant global information Approximate or aggregated global information (average/max load) Incomplete global info (only “neighborhood”) Work diffusion strategies (1980’s work by author and others!) –Achieving global effects by local action…

38 38 Building on Object-based Load Balancing Application induced load imbalances Environment induced performance issues: –Dealing with extraneous loads on shared machines –Vacating workstations –Heterogeneous clusters –Shrinking and expanding the set of processors allocated to a job! Automatic checkpointing –Restart on a different number of processors Pre-fetch capability –Out of Core execution –Optimizing Cache performance

39 39 Electronic Structures using CP Car-Parinello method Based on pinyMD –Glenn Martyna, Mark Tuckerman Data structures: –A bunch of states (say 128) –Represented as 3D arrays of coeffs in G- space, and also 3D arrays in real space –Real-space prob. density –S-matrix: one number for each pair of states For orthonormalization –Nuclei Computationally –Transformation from g-space to real-space Use multiple parallel 3D- FFT –Sums up real-space densities –Computes energies from density –Computes forces –Normalizes g-space wave function

40 40 One Iteration

41 41

42 42 Points of interest Parallelizing the 3d-fft –Optimization of computation –Optimization of communication Normalization –Involves all-to-all communication

43 43 3D FFT

44 44 Optimizating FFT 128 parallel FFTs –Need to optimize The 3D-space is sparsely populated Reduce the number of FFTs Reduce the amount of data transported –Use run-length encoding –Communication library

45 45 Optimizations But, is this the biggest problem we have?


Download ppt "1 Conclusion of Communication Optimizations and start of Load Balancing Techniques CS320 Spring 2003 Laxmikant Kale Parallel Programming."

Similar presentations


Ads by Google