More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.

More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers

Fault-Tolerance, Malleability and Migration for Divide-and- Conquer Applications on the Grid Wrzesinska et al.

Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid  3 general class of divisible applications  Master-worker paradigm – 1 level  Hierarchical master-worker grid system – 2 levels  Divide-and-conquer paradigm – allows computation to be split up in a general way. E.g. search algorithms, ray tracing etc.  The work deals with mechanisms to deal with processors leaving  Handling partial results from leaving processors  Handling orphan work  2 cases of processors leaving  When processors leave gracefully (e.g. when processor reservation comes to an end)  When processors crash  Restructuring computation tree

Introduction  Divide-and-conquer  Recursive subdivision; After solving subproblems, their results are recursively combined until the final solution is reached.  Work is distributed across processors by work-stealing  When a processor runs out of work, it picks another processor at random and steals a jobs from its work queue  After computing the jobs, the result is returned to the originating processor  Have a work-stealing algorithm called CRS (Cluster- aware random stealing) that overlaps intra-cluster steals with inter-cluster steals

Malleability  Adding a new machine to a divide-and-conquer computation is simple  New machine starts stealing jobs from other machines  Leaving of a processor - Restructuring of the computation tree to reuse as many partial results as possible  What happens when processors leave  remaining processors are notified by leaving processor (when processors leave gracefully)  detected by the communication layer (in unexpected leaves)

Recomputing jobs stolen by leaving processors  Each processor maintains a list of jobs stolen from it and the processor Ids of the thieves  When processors leave  Each of the remaining processors traverses its stolen jobs list, searches for jobs stolen by leaving processors  Such jobs are put back in the work queues of owners, marked as “restarted”  Children of “restarted” jobs are also marked as “restarted” when they are spawned

Example

Example (Contd…)

Orphan Jobs  Jobs stolen from leaving processors  Existing approaches  Processor working on an orphan job must discard the result, since it does not know where to return the result  Need to know the new address to return the result  Salvaging orphan jobs requires creating the link between the orphan and its restarted parent

Orphan Jobs (Contd…)  For each finished orphan job  Broadcast of a small message containing the jobID of the orphan and the processorID that computed the orphan  Abort unfinished intermediate nodes of orphan subtrees  (jobID, processorID) stored by each processor in a local orphan table

Orphan Jobs (Contd…)  When a processor tries to recompute “restarted” jobs  Processors perform lookup in orphan table  If the jobIDS match, the processor removes it from the workqueue, puts it in the list of stolen jobs  Send message to the orphan owner requesting result of the job  Orphan owner marks it as stolen from the sender of the request  Link between restarted parent and orphaned child is restored  Reusing orphans improves performance of the system

Example

Partial Results on Leaving Processors  If a processor knows it has to leave:  Chooses another processor randomly  Transfers all results of finished jobs to the other processor  The jobs are treated as orphan jobs  Processor receiving the finished jobs broadcasts a (jobID, processorID) tuple  Partial results linked to the restarted parents

Special Cases  Master leaving – special case; owns root job that was not stolen from anyone  Remaining processors elect the new master which will respawn the root job  New run will reuse partial results of orphan jobs from previous run  Adding processors  New processor downloads an orphan table from one of the other processors  Piggybacks orphan table requests with steal requests  Message combining  One small (broadcast) message has to be sent for each orphan and for each computed job in the leaving processor  Messages are combined

Results  3 Types  Overhead when no processors are leaving  Comparison with traditional approach that does not save orphans  To show that mechanism can be used for efficient migration of the computation  Testbeds  DAS-2 system, 5 clusters in five Dutch Universities  European GridLab – 24 processors in 4 sites in Europe  8 in Leiden and 8 in Delft (DAS-2)  4 in Berlin  4 in Brno

Overhead during normal Execution  4 applications on a system with and without their mechanisms  RayTracer, TSP, SAT solver, Knapsack problem  Overhead is negligible

Impact of Salvaging Partial Results  RayTracer Application  2 DAS-2 clusters with 16 processors each  Removed one cluster in the middle of the computation, after half of the time it would take on 2 clusters without processors leaving  Comparison of  Traditional approach (without saving partial results)  Recomputing trees when processors leave unexpectedly  Recomputing trees when processors leave gracefully  Runtime on 1.5 clusters (16 on processors in 1 cluster and 8 processors in another cluster)  Difference between last two gives overhead of transferring the partial results from leaving processors and the work lost because of the leaving processors

Results

Migration  Replaced one cluster with another  Raytracer application on 3 clusters  In the middle of the computation, one cluster was gracefully removed, and another identical cluster added  Comparison without migration  Overhead of migration – 2%

References  Predicting the cost and benefit of adapting data parallel applications in clusters. Journal of Parallel and Distributed Computing. Volume 62, Issue 8 (August 2002) Pages: 1248 - 1271 Year of Publication: 2002 Author Jon B. Weissman Jon B. Weissman Jon B. Weissman  Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid," Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, vol., no.pp. 13a- 13a, 04-08 April 2005

Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman  Library of adaptation techniques  Migration  Involves remote process creation followed by transmission of old worker’s data to new worker  Dynamic load balancing  Collecting load indices, determining redistribution and initiating data transmission  Addition or removal of processors  Followed by data transmission to maintain load balance  Library calls to detect and initiate adaptation actions within the applications  Adaptation event sent from an external detector to all workers

More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.

Similar presentations

Presentation on theme: "More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.

Similar presentations

Presentation on theme: "More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers."— Presentation transcript:

Similar presentations

About project

Feedback