Download presentation
Presentation is loading. Please wait.
Published byVincent Higgins Modified over 9 years ago
1
1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et Technique CERIST Laboratoire d’Informatique Fondamentale de Lille A.Bendjoudi ahcene.bendjoudi@gmail.com N. Melab and E-G. Talbi {melab,talbi}@lifl.fr
2
2 Outline Motivations FTH-B&B: Fault Tolerant hierarchical Branch and Bound algorithm Performance Evaluation Conclusion and Perspectives
3
3 Motivations Combinatorial Optimization Problems are Np-Hard and CPU-Intensive Their exact resolution requires a huge amount of computing resources Branch and Bound B&B algorithms are the best known methods Implicit enumeration of the search space They use a pruning technique to reduce the search space however they remain not sufficient for very large instances High performance computing (Computational Grids) Computational Grids offer large amount of computing resources Issue: Scalability Traditional Master/Worker-based B&B (MW-B&B) algorithms are inefficient Master process becomes rapidly bottleneck Solution: Hierarchical Master/Worker-based B&B (HMW-B&B) 8
4
4 Branch and Bound Algorithm Branching Original problem is partitioned into smaller sub problems. Bounding Compute the estimated optimal solution (lower bound) of the considered problem and compare to the best known solution (upper bound). Elimination Identify nodes which don’t lead to The best solution and eliminate them Selection Exploration strategy (best first search, depth first search, best bound,…) Problem to be solved Sub-problem Solutions P0P0 8 Optimal solution Branching Elimination Selection
5
5 Motivations Resources of computational grids are highly unreliable and volatile Issue: Fault-tolerance Fault tolerance at middleware level: ProActive, XtremWeb, Condor, … They are costly in terms of execution time Solution: Application level fault-tolerant algorithm Fault detection Task recovery Minimization of Redundant work Maintenance of the hierarchy
6
6 Outline Motivations FTH-B&B: Fault Tolerant Hierarchical Branch and Bound algorithm Architecture Work management with task recovery Maintenance of the hierarchy Distributed checkpointing Performance Evaluation Conclusion et Perspectives
7
7 Architecture Based on HMW paradigm dealing with Scalability & Fault tolerance Several FT MW-B&Bs organized hierarchically into groups Each FT MW-B&B is composed of: Single Master (on the top and inner nodes of the hierarchy): parallel recursive branching Workers (Leaves): parallel exploration or Masters (inner nodes): parallel branching Each master owns a single work pool Multiple work pools Collegial strategy Workers Masters Fault Tolerant M/W-based B&Bs Multiple work pools Group of masters and workers
8
8 Fault detection A master heartbeats its workers Workers heartbeat their master Master Heartbeats Workers
9
9 Work management & task recovery Each master manages: A list of assigned sub-problems (mapping between the assigned sub-problem and the child process assigned to it) A list of branched sub-problems LBS (sub-problems being explored by its children) When a failure is detected only the unexplored sub-problems of the lost root problem in the LBS are rescheduled to another safe process R AB PAPA PBPB PCPC PDPD PAPA Work pool List of branched sub-problems LBS P A1 P A2
10
10 3-phase communication mechanism Phase 1 (between a master and its children): A master assigns a problem to its children It receives back the branched sub-problems Phase 2 (between a master, its children, and its parent): A master updates explored sub-problems in the LBS each time a child finishes the exploration of its sub-problem. the master knows at any time the unexplored parts of a given problem. Phase 3 (between a master and a new free process): When a process fails The parent of the failed process detects its failure and saves the unexplored part of its sub-problem When a new safe process connects, the parent reschedules it the unexplored part of the sub-problem R AB PAPA PBPB PCPC PDPD PAPA P A1 P A2 Connection of a new worker
11
11 Maintenance of the hierarchy Motivations Failures can cause a loss of data and/or computing power Master’s failures can isolate some parts of processes from the rest of the hierarchy and cause orphan branches Loss of processes can cause unsafe execution of the algorithm (non termination, redundant work,...) Failures can unbalance the hierarchy if only the parent of the failed processes is involved in the fault recovery process New bottlenecks can be created on parts of the hierarchy Proposed techniques Simple connection to Ascendants SCA Master Election ME Balanced Hierarchy BH
12
12 Simple connection to Ascendants (SCA) Children of the failed master connect to the closest safe ascendant Each process holds the list of its ascendants No process reassignment When a process receives a connection from an orphan process, it considers it as a new child Drawback: Risk of creation of new bottlenecks If f children fail, the safe master receives gxf connections from its grandchildren If l levels of masters fail, it receives g l+1 connections Convergence to MW-B&B
13
13 Master Election (ME) Orphan processes elect a new master among them using the bully algorithm The new selected master considers the other electors as its children It connects to its closest safe ascendant using SCA It is informed by its new parent about its neighbors
14
14 Balanced Hierarchy (BH) Orphan processes migrate to a safe sub-B&B The migration strategy respects the maximum group size threshold Assigns orphan processes to a non- full sub-B&B A master receiving orphan processes: Holds them only if the group it manages is non-full Otherwise, it dispatches them among its children Avoids the convergence to a MW
15
15 Distributed Checkpointing Several levels of checkpointing Each sub-MW-B&B performs distributed checkpointing independently There is no global backup file and each master handles its own local backup file A master saves unexplored sub-problems and updates its file each time a new sub-problem is explored. Clusters Backup Files Workers Masters A new considered master checks the buck-up file Reconstitutes the unexplored sub-problems Carries on the last consistent global state.
16
16 Performance Evaluation (Objectives) Evaluate the ability of FTH-B&B to deal with Fault Tolerance issue Measure the efficiency of FTH-B&B in terms of execution time Measure the benefit of task recovery on FTH-B&B in terms of minimizing the redundant work Measure the impact of failures on the hierarchy of FTH-B&B using (SCA, ME, and BH)
17
17 Performance Evaluation (Environment) FTH-B&B has been experimented on the Flowshop Scheduling Problem FSP (Taillard instances) Application developed on top of the ProActive middleware GRID’5000 : https://www.grid5000.frhttps://www.grid5000.fr Between 1900 and 8900 processes have been deployed To obtain deeper hierarchy, several processes have been deployed on a same host and the size of the sub-B&Bs is fixed to 10 Failures are obtained killing processes chosen uniformly during their lifetime
18
18 N jobs to be scheduled on M machines Each machine can not be simultaneously assigned to two jobs (colors) Jobs (colors) must be scheduled in the same order on all machines One objective must be minimized Cmax: Makespan (Total completion time) M1M1 M2M2 M3M3 The Flow-Shop Scheduling Problem 4 jobs on 3 machines
19
19 Efficiency of FTH-B&B Efficiency = ratio between the exploration time and the total execution time including Idle times. Idle time = communication time + 3-phase time + management time 3-phase mechanism is not costly in terms of execution time 3-phase consumes between 0,20% – 0,40% of the total execution time The workers of FTH-B&B spend on average 98,92% of their time solving sub-problems
20
20 Minimizing Redundant work The masters of FTH-B&B reschedule less sub-problems when using 3-phase mechanism less redundant work On average when using the 3-phase mechanism the algorithm is 6,36 times more efficient.
21
21 Maintenance of the hierarchy using SCA The figure presents the four most loaded masters in the hierarchy Degree of a process = number of children x-axis = Time y-axis = degree of masters Degrees increase with the number of failed processes The root-master process rapidly becomes a bottleneck (passes from 10 children at the beginning of the execution to 1368 children at the end) Convergence to MW paradigm
22
22 Maintenance of the hierarchy using ME The masters are less loaded than compared with SCA When a new master is designated, it receives the highest identifier it will not be designated in the next election session. Nevertheless, the average number of children has doubled after the failure of 25% of processes.
23
23 Maintenance of the hierarchy using BH The figure shows the average degree masters in each level of the hierarchy Up to 8 masters are killed every 1 second The whole masters resist to the failures and do not become bottlenecks The root masters maintains the same degree during all the lifetime The degrees decreases softly for the 3 first levels and more rapidly for the last level caused by the new converted masters that have only one worker as a children
24
24 Conclusion FTH-B&B: Application level Fault tolerant algorithm Several FT-MW-based B&Bs are launched hierarchically 3-phase communication mechanism (between a master, its children, and its grand-children) Distribute, store, and recover work units in case of failures Minimizing redundant work Efficiency = 98% 6,36 times more efficient than without using 3-Phase mechanism 3 strategies to maintain the hierarchy SCA, ME, and BH BH has proved its efficiency in terms of balancing the hierarchy Distributed checkpointing mechanism Several checkpointing levels
25
25 Perspectives Take into account best effort exploit the maximum number of computing resources Improve the realism of FTH-B&B use a realistic model of failures based on real statistics of grids
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.