Load Balancing How? –Partition the computation into units of work (tasks or jobs) –Assign tasks to different processors Load Balancing Categories –Static.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Ch 11 Distributed Scheduling –Resource management component of a system which moves jobs around the processors to balance load and maximize overall performance.
Distributed Leader Election Algorithms in Synchronous Ring Networks
Single Source Shortest Paths
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Graph Algorithms Carl Tropper Department of Computer Science McGill University.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Topological Sort Topological sort is the list of vertices in the reverse order of their finishing times (post-order) of the depth-first search. Topological.
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
1 Load Balancing and Termination Detection ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2009.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 7: Load Balancing and Termination Detection.
EECC756 - Shaaban #1 lec # 8 Spring Synchronous Iteration Iteration-based computation is a powerful method for solving numerical (and some.
CS 582 / CMPE 481 Distributed Systems
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Strategies for Implementing Dynamic Load Sharing.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Distributed process management: Distributed deadlock
Load Balancing How? –Partition the computation into units of work (tasks or jobs) –Assign tasks to different processors Load Balancing Categories –Static.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Broadcast & Convergecast Downcast & Upcast
CSCI-455/552 Introduction to High Performance Computing Lecture 18.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Distributed Asynchronous Bellman-Ford Algorithm
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Load Balancing and Termination Detection Load balance : - statically before the execution of any processes - dynamic during the execution of the processes.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
WAN technologies and routing Packet switches and store and forward Hierarchical addresses, routing and routing tables Routing table computation Example.
Network Aware Resource Allocation in Distributed Clouds.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Chapter 3: Processes. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts - 7 th Edition, Feb 7, 2006 Process Concept Process – a program.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Data Communications and Networking Chapter 11 Routing in Switched Networks References: Book Chapters 12.1, 12.3 Data and Computer Communications, 8th edition.
Presenter: Long Ma Advisor: Dr. Zhang 4.5 DISTRIBUTED MUTUAL EXCLUSION.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Introduction to Graphs And Breadth First Search. Graphs: what are they? Representations of pairwise relationships Collections of objects under some specified.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Graph Algorithms Gayathri R To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Parallel Graph Algorithms Sathish Vadhiyar. Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between.
HYPERCUBE ALGORITHMS-1
Distributed, Self-stabilizing Placement of Replicated Resources in Emerging Networks Bong-Jun Ko, Dan Rubenstein Presented by Jason Waddle.
1 Chapter 11 Global Properties (Distributed Termination)
Dynamic Load Balancing Tree and Structured Computations.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
William Stallings Data and Computer Communications
Load Balancing Definition: A load is balanced if no processes are idle
Load Balancing and Termination Detection
Parallel Graph Algorithms
The Echo Algorithm The echo algorithm can be used to collect and disperse information in a distributed system It was originally designed for learning network.
Parallel Programming By J. H. Wang May 2, 2017.
Task Scheduling for Multicore CPUs and NUMA Systems
Graphs Chapter 11 Objectives Upon completion you will be able to:
Parallel Sort, Search, Graph Algorithms
Load Balancing Definition: A load is balanced if no processes are idle
Adaptivity and Dynamic Load Balancing
Introduction to High Performance Computing Lecture 17
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Load Balancing How? –Partition the computation into units of work (tasks or jobs) –Assign tasks to different processors Load Balancing Categories –Static (load assigned before application runs) –Dynamic (load assigned as applications run) oCentralized (Tasks assigned by the master or root process) oDe-centralized (Tasks reassigned among slaves) –Semi-dynamic (application periodically suspended and load balanced) Load Balancing Algorithms are: –Adaptive if they adapt to different system load levels oThresholds control how they adapt –Stable if load balancing traffic is independent of load levels –Symmetric if both senders and receivers initiate action –Effective if load balancing overhead is minimal A load is balanced if no processes are idle

Improving the Load Balance By realigning processing work, we improve speed-up

Static Load Balancing Round Robin –Tasks given to processes in sequential order. –If there are more tasks than processors, the allocation wraps around to the first Randomized –Tasks are assigned randomly to processors Partitioning – Tasks represented by a graph –Recursive Bisection –Simulated Annealing –Genetic Algorithms –Multi-level Contraction and Refinement Advantage –Simple to implement –Minimal run time overhead Disadvantages –Predicting execution times is often not knowable before execution –Affect of communication dynamics is often not considered –The number of iterations is often indeterminate Done prior to executing the parallel application

Dynamic Load Balancing Centralized –A single process hands out tasks –Processes ask for more work when their processing completes –Double buffering can be effective Decentralized –Processes detect that their work load is low –Processes can sense an overload condition This occurs when new tasks are spawned during execution –Questions Which neighbors are part of the rebalancing? How should thresholds be set? What are the communications needed to balance? How often should balancing occur? Done as a parallel application executes

Centralized Load Balancing Master Processor While ( task=Remove()) != null) Receive(p i, request_msg) Send(p i, task) While(more processes) Receive(p i, request_msg) Send(p i, termination_msg) Slave Processor task = Receive(p master, message) While (task!=terminate) Process task Send(p master, request_msg) task = Receive(p master, message) Work Pool, Processer Farm, or Replicated Worker Algorithm Slaves Master In this case, the slaves don’t spawn new tasks

Centralized Termination Necessary Requirements –The task queue is empty –Every process has requested another task Master Processor WHILE (true) Receive(p i, msg) IF msg contains a new task Add the new task to the task queue ELSE Add p i to wait queue and waitCount++ IF waitCount>0 and task queue not empty Remove p i & task respectively from wait & task queue Send(task, p i ) and waitCount—- IF waitCount==P THEN send termination messages & exit How do we terminate when slave processes spawn new tasks?

Decentralized Load Balancing There is no Master Processor Each Processor maintains a work queue Processors interact with neighbors to request and distribute tasks (Worker processes interact among themselves)

Decentralized Mechanisms Receiver Initiated –Process requests tasks when it is about to go idle –Effective when the load is heavy –Unstable when the load is light (A request frequency threshold is necessary) Sender Initiated –Process with a heavy load distributes the excess –Effective when the load is heavy –Can cause thrashing when loads are heavy (synchronizing system load with neighbors is necessary) Balancing is among a subset of the total running processes Application Balancing Algorithm Task Queue

Process Selection Global or Local? –Global involves all of the processors of the network May require expensive global synchronization May be difficult if the load dynamic is rapidly changing –Local involves only neighbor processes Overall load may not be balanced Easier to manage and less overhead than the global approach Neighbor selection algorithms –Random: randomly choose another process Easy to implement and studies show reasonable results –Round Robin: Select among neighbors using modular arithmetic Easy to implement. Results similar to random selection –Adaptive Contracting: Issue bids to neighbors; best bid wins Handshake between neighbors needed Possible to synchronize loads

Choosing Thresholds How do we estimate system load? –Synchronization averages task queue length or processes –Average number of tasks or projected execution time When is the load low? –When a process is about to go idle –Goal: prevent idleness, not achieve perfect balance –A low threshold constant is sufficient When is the load high? –When some processes have many tasks and others are idle –Goal: prevent thrashing –Synchronization among processors is necessary –An exponentially growing threshold works well What is the job request frequency? –Goal: minimize load balancing overhead

Gradient Algorithm Node Data Structures –For each neighbor Distance, in hops, to the nearest lightly-loaded process –A load status flag indicating if the current processor is lightly- loaded, or normal Routing –Spawned jobs go to the nearest lightly-loaded process Local Synchronization –Node status changes are multicast to its neighbors L Maintains a global pressure grid

Symmetric Broadcast Networks (SBN) Characteristics –A unique SBN starts at each node –Each SBN is lg P deep –Simple operations algebraically compute successors –Easily adapts to the hypercube Algorithm –Starts with a lightly loaded process –Phase 1: SBN Broadcast –Phase 2: Gather task queue lengths –Load is balanced during the load and gather phases Global Synchronization Stage 0 Stage 1 Stage 2 5 Stage 3 Successor 1 = (p+2 s-1 ) %P; 1≤s ≤ 3 Successor 2 = (p-2 s-1 ); 1≤s<3 Note: If successor 2<0 successor2 +=P

Line Balancing Algorithm Master processor adds to the pipeline Slave processors –Request and receives tasks if queue not full –Pass tasks on if task request is posted Non blocking receives are necessary to implement this algorithm Uses a pipeline approach Request task if queue not full Receive task from request Deliver task to p i+1 p i+1 requests task Dequeue and process task pipi Note: This algorithm easily extends to a tree topology

Semi-dynamic Pseudo code Run algorithm Time to check balance? Suspend application IF load is balanced, resume application Re-partition the load Distribute data structures among processors Resume execution Partitioning –Model application execution by a partitioning graph –Partitioning is an NP-Complete problem –Goals: Balance processing and minimize communication –Partitioning Heuristics Recursive Bisection, Simulated Annealing, Multi-level, MinEx –Data Redistribution Goal: Minimize the data movement cost

Partitioning Graph P2 R1 P5 R3 P8 R3 P4 R1 P6 R6 P2 R1 P9 R6 P4 R4 P7 R5 P1 P2 c4 c6 c2 c1 c7 c1 c3 c8 c5 c3 P1 Load = ( ) + ( ) = 37 P2 Load = ( ) + ( ) = 40 Question: When can we move a task to improve load balance?

Distributed Termination Insufficient condition for distributed termination –Empty task queues at every process Sufficient condition for distributed termination requires –All local termination conditions satisfied –No messages in transit that could restart an inactive process Termination algorithms –Acknowledgment –Ring –Tree –Fixed energy distribution

Acknowledgement Termination Process Receives task –Immediately acknowledge if source is not parent –Acknowledge parent as process goes idle Process goes idle after it –completes processing local tasks –Sends all acknowledgments –Receives all acknowledgments Note –A process always becomes inactive before its parent –The application can terminate when the master goes idle Active Inactive First task Acknowledge first task PiPi PjPj Definition : Parent is the process sending initial task to a process

Single Pass Ring Termination Pseudo code P 0 sends a token to P 1 when it goes idle P i receives token IF P i is idle it passes token to P i+1 ELSE P i sends token to P i+1 when it goes idle P 0 receives token Broadcast final termination message Assumptions –Processes cannot reactivate after going idle –Processes cannot pass new tasks to an idle process P0P0 P1P1 P2P2 PnPn Token

Dual Pass Ring Termination Pseudo code WHEN P 0 goes idle, it sends a white token to p 1 WHEN P i sends a task to P j where j<i P i becomes a black process WHEN P i>0 receives token and goes idle IF P i is a black process P i colors the token black, P i becomes White ELSE P i sends token to P (i+1)%n unchanged in color IF P 0 receives token and is idle IF token is White, application terminates ELSE p o sends a White token to P 1 Handles task sent to a process that already passed the token on Key Point: Token and processors are colored either White or Black

Tree Termination When a Leaf process terminates, it sends a token to it’s parent process Internal nodes send tokens to it’s parent when all of its children processes terminate When the root node receives the token, the application can terminate Either one-pass or two pass algorithms can apply AND Leaf Nodes Terminated

Fixed Energy Termination P 0 starts with full energy –When P i receives a task, it also receives an energy allocation –When P i spawns tasks, it assigns them to processors with additional energy allocations within its allocation –When a process completes it returns its energy allotment The application terminates when the master becomes idle Implementation –Problem: Integer division eventually becomes zero –Solution: oUse two level energy allocation oThe generation increases each time energy value goes to zero Energy defined by an integer or long value

Example: Shortest Path Problem Definitions Graph: Collection of nodes (vertices) and edges Directed Graph: Edge can be traversed in only one direction Weighted Graph: Edges have weights that define cost Shortest Path Problem: Find the path from one node to another in a weighted graph that has the smallest accumulated weights Applications 1.Shortest distance between points on a map 2.Quickest travel route 3.Least expensive flight path 4.Network routing 5.Efficient manufacturing design

Climbing a Mountain Weights: expended effort Directed graph –Effort in one direction ≠ effort in another direction –Ex: Downhill versus uphill ABCDEF A10 B C14 D9 E17 F A BC D E F Adjacency Matrix C8 D14X E9X F17X X B10X D13 E24 F51 A B C D E F Adjacency List Graphic Representation

Moore’s Algorithm Assume –w[i][j] =weight of edge (i,j) –Dist[v] = distance to vertex v –Pred[v] = predecessor to vertex v Pseudo code Insert the source vertex into a queue For each vertex, v, dist[v]=∞ infinity, dist[0] = 0 WHILE (v = dequeue() exists) FOR (j=; j<n; j++) newdist = dist[i] + w[i][j] IF (newdist < dist[j]) dist[j] = newdist pred[j] = I append(j) Less efficient than Dijkstra but more easily parallelized ij didi w i,j djdj d j =min(d j,d i +w i,j )

Graph Analysis Stages A 0∞∞∞∞∞ B 010∞∞∞∞ E FEDC DC CE ABCDEF Vertex QueueDist[j] EDC

Centralized Work Pool Solution The Master maintains –The work pool queue of unchecked vertices –The distance array Every slave holds –The graph weights which is static The Slaves –Request a vertex –Compute new minimums –Send updated distance values and vertex to master The Master –Appends received vertices to its work queue –Sends new vertex and the updated distance array.

Distributed Work Pool Solution Data held in each processor –The graph weights –The distances to vertices stored locally –The processor assignments When a process receiving a distance: –If its local value is reduced oUpdates its local value of dist[v] oSend distances to adjacent vertices to appropriate processors Notes –Inefficient with one vertex per processor oPoor computation to communication ratio oMany processors can be inactive –One of the termination algorithms is necessary