1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et.

1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et Technique CERIST Laboratoire d’Informatique Fondamentale de Lille A.Bendjoudi ahcene.bendjoudi@gmail.com N. Melab and E-G. Talbi {melab,talbi}@lifl.fr

2 Outline  Motivations  FTH-B&B: Fault Tolerant hierarchical Branch and Bound algorithm  Performance Evaluation  Conclusion and Perspectives

3 Motivations  Combinatorial Optimization Problems are Np-Hard and CPU-Intensive  Their exact resolution requires a huge amount of computing resources  Branch and Bound B&B algorithms are the best known methods  Implicit enumeration of the search space  They use a pruning technique to reduce the search space  however they remain not sufficient for very large instances  High performance computing (Computational Grids)  Computational Grids offer large amount of computing resources  Issue: Scalability  Traditional Master/Worker-based B&B (MW-B&B) algorithms are inefficient  Master process becomes rapidly bottleneck Solution: Hierarchical Master/Worker-based B&B (HMW-B&B) 8

4 Branch and Bound Algorithm Branching Original problem is partitioned into smaller sub problems. Bounding Compute the estimated optimal solution (lower bound) of the considered problem and compare to the best known solution (upper bound). Elimination Identify nodes which don’t lead to The best solution and eliminate them Selection Exploration strategy (best first search, depth first search, best bound,…) Problem to be solved Sub-problem Solutions P0P0 8 Optimal solution Branching Elimination Selection

5 Motivations  Resources of computational grids are highly unreliable and volatile  Issue: Fault-tolerance  Fault tolerance at middleware level: ProActive, XtremWeb, Condor, …  They are costly in terms of execution time Solution: Application level  fault-tolerant algorithm  Fault detection  Task recovery  Minimization of Redundant work  Maintenance of the hierarchy

6 Outline  Motivations  FTH-B&B: Fault Tolerant Hierarchical Branch and Bound algorithm  Architecture  Work management with task recovery  Maintenance of the hierarchy  Distributed checkpointing  Performance Evaluation  Conclusion et Perspectives

7 Architecture  Based on HMW paradigm dealing with Scalability & Fault tolerance  Several FT MW-B&Bs organized hierarchically into groups  Each FT MW-B&B is composed of:  Single Master (on the top and inner nodes of the hierarchy): parallel recursive branching  Workers (Leaves): parallel exploration or Masters (inner nodes): parallel branching  Each master owns a single work pool  Multiple work pools  Collegial strategy Workers Masters Fault Tolerant M/W-based B&Bs Multiple work pools Group of masters and workers

8 Fault detection  A master heartbeats its workers  Workers heartbeat their master Master Heartbeats Workers

9 Work management & task recovery  Each master manages:  A list of assigned sub-problems (mapping between the assigned sub-problem and the child process assigned to it)  A list of branched sub-problems LBS (sub-problems being explored by its children)  When a failure is detected  only the unexplored sub-problems of the lost root problem in the LBS are rescheduled to another safe process R AB PAPA PBPB PCPC PDPD PAPA Work pool List of branched sub-problems LBS P A1 P A2

10 3-phase communication mechanism  Phase 1 (between a master and its children):  A master assigns a problem to its children  It receives back the branched sub-problems  Phase 2 (between a master, its children, and its parent):  A master updates explored sub-problems in the LBS each time a child finishes the exploration of its sub-problem.   the master knows at any time the unexplored parts of a given problem.  Phase 3 (between a master and a new free process):  When a process fails  The parent of the failed process detects its failure and saves the unexplored part of its sub-problem  When a new safe process connects, the parent reschedules it the unexplored part of the sub-problem R AB PAPA PBPB PCPC PDPD PAPA P A1 P A2 Connection of a new worker

11 Maintenance of the hierarchy  Motivations  Failures can cause a loss of data and/or computing power  Master’s failures can isolate some parts of processes from the rest of the hierarchy and cause orphan branches  Loss of processes can cause unsafe execution of the algorithm (non termination, redundant work,...)  Failures can unbalance the hierarchy if only the parent of the failed processes is involved in the fault recovery process  New bottlenecks can be created on parts of the hierarchy  Proposed techniques  Simple connection to Ascendants SCA  Master Election ME  Balanced Hierarchy BH

12 Simple connection to Ascendants (SCA)  Children of the failed master connect to the closest safe ascendant  Each process holds the list of its ascendants  No process reassignment  When a process receives a connection from an orphan process, it considers it as a new child  Drawback:  Risk of creation of new bottlenecks  If f children fail, the safe master receives gxf connections from its grandchildren  If l levels of masters fail, it receives g l+1 connections  Convergence to MW-B&B

13 Master Election (ME)  Orphan processes elect a new master among them using the bully algorithm  The new selected master considers the other electors as its children  It connects to its closest safe ascendant using SCA  It is informed by its new parent about its neighbors

14 Balanced Hierarchy (BH)  Orphan processes migrate to a safe sub-B&B  The migration strategy respects the maximum group size threshold  Assigns orphan processes to a non- full sub-B&B  A master receiving orphan processes:  Holds them only if the group it manages is non-full  Otherwise, it dispatches them among its children  Avoids the convergence to a MW

15 Distributed Checkpointing  Several levels of checkpointing  Each sub-MW-B&B performs distributed checkpointing independently  There is no global backup file and each master handles its own local backup file  A master saves unexplored sub-problems and updates its file each time a new sub-problem is explored. Clusters Backup Files Workers Masters  A new considered master checks the buck-up file  Reconstitutes the unexplored sub-problems  Carries on the last consistent global state.

16 Performance Evaluation (Objectives)  Evaluate the ability of FTH-B&B to deal with Fault Tolerance issue  Measure the efficiency of FTH-B&B in terms of execution time  Measure the benefit of task recovery on FTH-B&B in terms of minimizing the redundant work  Measure the impact of failures on the hierarchy of FTH-B&B using (SCA, ME, and BH)

17 Performance Evaluation (Environment)  FTH-B&B has been experimented on the Flowshop Scheduling Problem FSP (Taillard instances)  Application developed on top of the ProActive middleware  GRID’5000 : https://www.grid5000.frhttps://www.grid5000.fr  Between 1900 and 8900 processes have been deployed  To obtain deeper hierarchy, several processes have been deployed on a same host and the size of the sub-B&Bs is fixed to 10  Failures are obtained killing processes chosen uniformly during their lifetime

18  N jobs to be scheduled on M machines  Each machine can not be simultaneously assigned to two jobs (colors)  Jobs (colors) must be scheduled in the same order on all machines  One objective must be minimized  Cmax: Makespan (Total completion time) M1M1 M2M2 M3M3 The Flow-Shop Scheduling Problem 4 jobs on 3 machines

19 Efficiency of FTH-B&B  Efficiency = ratio between the exploration time and the total execution time including Idle times.  Idle time = communication time + 3-phase time + management time  3-phase mechanism is not costly in terms of execution time  3-phase consumes between 0,20% – 0,40% of the total execution time  The workers of FTH-B&B spend on average 98,92% of their time solving sub-problems

20 Minimizing Redundant work  The masters of FTH-B&B reschedule less sub-problems when using 3-phase mechanism  less redundant work  On average when using the 3-phase mechanism the algorithm is 6,36 times more efficient.

21 Maintenance of the hierarchy using SCA  The figure presents the four most loaded masters in the hierarchy  Degree of a process = number of children  x-axis = Time  y-axis = degree of masters  Degrees increase with the number of failed processes  The root-master process rapidly becomes a bottleneck (passes from 10 children at the beginning of the execution to 1368 children at the end)  Convergence to MW paradigm

22 Maintenance of the hierarchy using ME  The masters are less loaded than compared with SCA  When a new master is designated, it receives the highest identifier  it will not be designated in the next election session.  Nevertheless, the average number of children has doubled after the failure of 25% of processes.

23 Maintenance of the hierarchy using BH  The figure shows the average degree masters in each level of the hierarchy  Up to 8 masters are killed every 1 second  The whole masters resist to the failures and do not become bottlenecks  The root masters maintains the same degree during all the lifetime  The degrees decreases softly for the 3 first levels and more rapidly for the last level caused by the new converted masters that have only one worker as a children

24 Conclusion  FTH-B&B: Application level Fault tolerant algorithm  Several FT-MW-based B&Bs are launched hierarchically  3-phase communication mechanism (between a master, its children, and its grand-children)  Distribute, store, and recover work units in case of failures  Minimizing redundant work  Efficiency = 98%  6,36 times more efficient than without using 3-Phase mechanism  3 strategies to maintain the hierarchy  SCA, ME, and BH  BH has proved its efficiency in terms of balancing the hierarchy  Distributed checkpointing mechanism  Several checkpointing levels

25 Perspectives  Take into account best effort  exploit the maximum number of computing resources  Improve the realism of FTH-B&B  use a realistic model of failures based on real statistics of grids

1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et.

Similar presentations

Presentation on theme: "1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et.

Similar presentations

Presentation on theme: "1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et."— Presentation transcript:

Similar presentations

About project

Feedback