Dynamic Load Balancing Tree and Structured Computations.

Dynamic Load Balancing Tree and Structured Computations

When to send work away: Consider a processor with k units of work, with P other processors, –assume that a message takes 100 microsecs to reach: 20 microseconds send-processor overhead, 60: network latency 20 receive processor overhead –If each task takes t units of time to complete, under what conditions should send them out to others (vs. doing it itself)? –E.g. if t=100 microseconds? 50? 1000? Key observation: the “master” spends 40 microseconds on coordination for a task, although the latency is 200 microsecs

Tree structured computations Examples: –Divide-and-conquer –State-space search: –Game-tree search –Bidirectional search –Branch-and-bound Issues: –Grainsize control –Dynamic Load Balancing –Prioritization

Divide and Conquer Simplest situation among the above –Given a problem, a recursive algorithm divides it into 1 (2) or more subproblems, and solutions to the subproblems are composed to create a solution –Example: adaptive quadrature –Consider a simpler setting: Fib(n) = fib(n-1) + fib(n-2) Note: the fibonacci algorithm is not important here –Issues: subtrees are unequal size, so can’t assign work a priori Fire tasks in parallel: but too fine-grained

Dynamic load balancing formulation: Each PE is creating work randomly How to redistribute work? Initial allocation Rebalancing Centralized vs distributed

Reading Assignment Adaptive grainsize control: –http://charm.cs.uiuc.edu go to publications, 95-05 Prioritization and first-solution search: –http://charm.cs.uiuc.edu go to publications, 93-06 Dynamic Load Balancing for tree structured computations: –Vipin Kumar’s papers (link to be added shotrly) –http://charm.cs.uiuc.edu go to publications, 93-13 A few more papers will be posted soon..

Adaptive grainsize control: Strategy 1: cut-off depth –(but must have an estimate of the size of the subtree) Strategy 2: stack splitting –Each PE maintains a stack of nodes of the tree –If my stack is empty, “steal” half the stack of someone else –Which part of the stack? Top? Bottom?

Adaptive grainsize control: Strategy 3: –Objects (tree nodes) decide whether to make children available for other processors by calling a function in the runtime –runtime monitors the size of its Queue (stack), and possibly size of other processor’s queues

Adaptive grainsize control: Strategy 3: Objects decide how big they want to grow –Monitor execution time (number of tree nodes evaluated) –If the number is above a threshold: Fire some of my nodes as independent objects to be mapped somewhere else –Problem: you sometimes get a “Mother” object that just keeps firing lots of smaller objects Solution: above the threshold, split the rest of the work into two objects and fire them off.

Dynamic load balancing Centralized: –maintain top levels of tree on one processor –serve requests for work on demand Variation: –hierarchical:

Fully Distributed strategies Keep track of neighbors Diffusion/Gradient model Neighborhood averaging What topology to use: –Machine’s –Hypercube –Denser?

Gradient model Misnomer:too broad a name Actual strategy: –Processor arranged in a topology (may be virtual, but the original purpose was to use real) –Each processor (tries to) maintain an estimate of how far it is from an idle processor –Idle processors have a distance of 0 –Other processors: periodically send their numbers to nbrs My distance = 1 + min(neighbor’s distance) –If my distance is more than a neighbor’s, send some work to it Work will “flow” towards idle processor

Neighborhood averaging Assume virtual topology Periodically send my own load (queuesize) to neighbors Each processor: –Calculate avergae load of its neighborhood –If I am above average, send pieces of work to underloaded neighbors so as to equalize them Estimate of work: –Assume the same for each unit –Use better estimate if known

Randomized strategies Random initial assignment: –As work is created, assign it to a PE –Problems: no way to correct errors –Each message goes across processor: communication overhead Random demand: –If I am idle, ask a randomly selected processor for work –If I get a demand, send half of my nodes to the requestor –Good theoretical properties –In practice: somewhat high overhead

Using Global Average Carry out a periodic global averaging to decide the average load on all processors If I am above average: –send work “away” –Alternatively, get a vector of overload via global averaging, and figure out whom to send what work

Scalability The Program should scale up to use a large number of processors. –But what does that mean? An individual simulation isn’t truly scalable Better definition of scalability: –If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Isoefficiency Quantify scalability How much increase in problem size is needed to retain the same efficiency on a larger machine? Efficiency : Seq. Time/ (P · Parallel Time) –parallel time = computation + communication + idle

Dynamic Load Balancing Tree and Structured Computations.

Similar presentations

Presentation on theme: "Dynamic Load Balancing Tree and Structured Computations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Load Balancing Tree and Structured Computations.

Similar presentations

Presentation on theme: "Dynamic Load Balancing Tree and Structured Computations."— Presentation transcript:

Similar presentations

About project

Feedback