Download presentation
Presentation is loading. Please wait.
Published byAllen Shields Modified over 8 years ago
1
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers
2
Fault-Tolerance, Malleability and Migration for Divide-and- Conquer Applications on the Grid Wrzesinska et al.
3
Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid 3 general class of divisible applications Master-worker paradigm – 1 level Hierarchical master-worker grid system – 2 levels Divide-and-conquer paradigm – allows computation to be split up in a general way. E.g. search algorithms, ray tracing etc. The work deals with mechanisms to deal with processors leaving Handling partial results from leaving processors Handling orphan work 2 cases of processors leaving When processors leave gracefully (e.g. when processor reservation comes to an end) When processors crash Restructuring computation tree
4
Introduction Divide-and-conquer Recursive subdivision; After solving subproblems, their results are recursively combined until the final solution is reached. Work is distributed across processors by work-stealing When a processor runs out of work, it picks another processor at random and steals a jobs from its work queue After computing the jobs, the result is returned to the originating processor Have a work-stealing algorithm called CRS (Cluster- aware random stealing) that overlaps intra-cluster steals with inter-cluster steals
5
Malleability Adding a new machine to a divide-and-conquer computation is simple New machine starts stealing jobs from other machines Leaving of a processor - Restructuring of the computation tree to reuse as many partial results as possible What happens when processors leave remaining processors are notified by leaving processor (when processors leave gracefully) detected by the communication layer (in unexpected leaves)
6
Recomputing jobs stolen by leaving processors Each processor maintains a list of jobs stolen from it and the processor Ids of the thieves When processors leave Each of the remaining processors traverses its stolen jobs list, searches for jobs stolen by leaving processors Such jobs are put back in the work queues of owners, marked as “restarted” Children of “restarted” jobs are also marked as “restarted” when they are spawned
7
Example
8
Example (Contd…)
9
Orphan Jobs Jobs stolen from leaving processors Existing approaches Processor working on an orphan job must discard the result, since it does not know where to return the result Need to know the new address to return the result Salvaging orphan jobs requires creating the link between the orphan and its restarted parent
10
Orphan Jobs (Contd…) For each finished orphan job Broadcast of a small message containing the jobID of the orphan and the processorID that computed the orphan Abort unfinished intermediate nodes of orphan subtrees (jobID, processorID) stored by each processor in a local orphan table
11
Orphan Jobs (Contd…) When a processor tries to recompute “restarted” jobs Processors perform lookup in orphan table If the jobIDS match, the processor removes it from the workqueue, puts it in the list of stolen jobs Send message to the orphan owner requesting result of the job Orphan owner marks it as stolen from the sender of the request Link between restarted parent and orphaned child is restored Reusing orphans improves performance of the system
12
Example
13
Partial Results on Leaving Processors If a processor knows it has to leave: Chooses another processor randomly Transfers all results of finished jobs to the other processor The jobs are treated as orphan jobs Processor receiving the finished jobs broadcasts a (jobID, processorID) tuple Partial results linked to the restarted parents
14
Special Cases Master leaving – special case; owns root job that was not stolen from anyone Remaining processors elect the new master which will respawn the root job New run will reuse partial results of orphan jobs from previous run Adding processors New processor downloads an orphan table from one of the other processors Piggybacks orphan table requests with steal requests Message combining One small (broadcast) message has to be sent for each orphan and for each computed job in the leaving processor Messages are combined
15
Results 3 Types Overhead when no processors are leaving Comparison with traditional approach that does not save orphans To show that mechanism can be used for efficient migration of the computation Testbeds DAS-2 system, 5 clusters in five Dutch Universities European GridLab – 24 processors in 4 sites in Europe 8 in Leiden and 8 in Delft (DAS-2) 4 in Berlin 4 in Brno
16
Overhead during normal Execution 4 applications on a system with and without their mechanisms RayTracer, TSP, SAT solver, Knapsack problem Overhead is negligible
17
Impact of Salvaging Partial Results RayTracer Application 2 DAS-2 clusters with 16 processors each Removed one cluster in the middle of the computation, after half of the time it would take on 2 clusters without processors leaving Comparison of Traditional approach (without saving partial results) Recomputing trees when processors leave unexpectedly Recomputing trees when processors leave gracefully Runtime on 1.5 clusters (16 on processors in 1 cluster and 8 processors in another cluster) Difference between last two gives overhead of transferring the partial results from leaving processors and the work lost because of the leaving processors
18
Results
19
Migration Replaced one cluster with another Raytracer application on 3 clusters In the middle of the computation, one cluster was gracefully removed, and another identical cluster added Comparison without migration Overhead of migration – 2%
20
References Predicting the cost and benefit of adapting data parallel applications in clusters. Journal of Parallel and Distributed Computing. Volume 62, Issue 8 (August 2002) Pages: 1248 - 1271 Year of Publication: 2002 Author Jon B. Weissman Jon B. Weissman Jon B. Weissman Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid," Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, vol., no.pp. 13a- 13a, 04-08 April 2005
21
Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman Library of adaptation techniques Migration Involves remote process creation followed by transmission of old worker’s data to new worker Dynamic load balancing Collecting load indices, determining redistribution and initiating data transmission Addition or removal of processors Followed by data transmission to maintain load balance Library calls to detect and initiate adaptation actions within the applications Adaptation event sent from an external detector to all workers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.