Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault-tolerant Scheduling of Fine- grained Tasks in Grid Environments.

Similar presentations


Presentation on theme: "Fault-tolerant Scheduling of Fine- grained Tasks in Grid Environments."— Presentation transcript:

1 Fault-tolerant Scheduling of Fine- grained Tasks in Grid Environments

2 Assumptions 1.Fail-stop model If a processor fails, it no longer transmits valid messages. 2.Reliable communication Processor crashes are detected eventually by the communication layer.

3 (Master) 2 is a descendant of 1

4 Map of Thief to Set of Tasks Each victim has a table of stolen tasks. Map > thiefTaskSet = new … When a task is stolen, a copy is put in the Set associated with that thief (Computer).

5 Global Result Table (GRT) Each compute server has a GRT replica: Map Entries are broadcast to all compute servers. The Map key & value are potentially large. It should be (more explanation later …) Map Where Computer is where the Result is stored.

6 Crash recovery method 1.If ( master crashed ) Elect a new master; 2.For all ( tasks stolen by a crashed processor ) Put task in task queue; 3.For all ( descendants of tasks stolen from a crashed processor ) If (descendant is finished) Then store it’s result in Global Result Table; Else abort the task; 4.If ( old master crashed && I am the master ) Restart the application;

7

8 Notes A task is an orphan if its parent task is on a crashed server. The authors: Our contribution: Some descendants of orphaned tasks are not recomputed. Descendants of orphaned tasks are aborted, if they are incomplete at the time they become orphans. They do not use explicit continuation passing: No composition tasks. Descendant decompositions that were complete must be recomputed!

9 Complete decomposition tasks 4, 8, & 14 are lost. In-progress task 21 is lost. Decompositions 2, 5, 10, 16 are lost

10 Notes Their GRT key is task parameters. –The hash code is sum the hash of the parameters If the parameter is an array, they sum the hash of each element! –It should be TaskId, but they do not have a processor- independent TaskId. This is claimed as future work. They claim: only 1 in 1000-10,000 tasks is stolen, which is key to the efficiency of their scheme. Their tests crash whole clusters, rather than individual compute servers within a cluster. Why?


Download ppt "Fault-tolerant Scheduling of Fine- grained Tasks in Grid Environments."

Similar presentations


Ads by Google