Hierarchical Coordinated Checkpointing Protocol Himadri Sekhar Paul. Arobinda Gupta. R. Badrinath. Dept. of Computer Sc. & Engg. Indian Institute of Technology, Kharagpur, INDIA
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 2 2 Motivation Long running application executing on Distributed Systems. – Metacomputer running over WAN. Prone to failure, fault tolerance is important. – Checkpoint and recovery technique.
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 3 3 Motivation Coordinated Checkpointing protocol is a popular scheme. Coordinated checkpointing protocol is bottlenecked by the slowest link in the network. Hierarchical Coordinated Checkpointing Protocol caters for the heterogeneous link speed, as in WAN.
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 4 4 System Model Nodes are fail-safe. Network is immune to partitioning. Links are unreliable. All computing nodes are reachable from the others. Network is hierarchically connected – Clusters of computing nodes realized by high speed networks. – Clusters inter-connected by lower speed networks.
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 5 5 Cluster System Model Computation Nodes
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 6 6 Flat Coordinated Checkpointing Protocol (2-phase commit) Message Checkpoint Ckpt Rqst Ack Ckpt Rqst Ckpt Estb Ack Ckpt Estb Process blocked … Coordinator Follower
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 7 7 AckCkpt_commit Message Checkpoint Initiator Follower Leader Ckpt_rqst AckCkpt_rqst Ckpt_estb Ckpt_commit Blocked Blocking at Extra-cluster msg AckCkpt_rqst AckCkpt_estb Ckpt_rqst AckCkpt_rqst Ckpt_estb AckCkpt_estbAckCkpt_commit
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 8 8 Simulation Result Simulation Setup – Two level network, with intra-cluster link speed of 10 Mbps and inter-cluster link speed of 1 Mbps. – Communication pattern of the application is random. – Varying fraction of extra-cluster application message. (Flat = Flat Coordinated Checkpointing Protocol) (Hier = Hierarchical Coordinated Checkpointing Protocol)
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 9 9 Simulation Result
Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 10 Conclusion & Future Work In a two-level hierarchical network the hierarchical checkpointing protocol incurs less latency than the flat checkpointing protocol, even for very high communication intensity. The protocol can be extended to a generic hierarchical network.