Hierarchical Coordinated Checkpointing Protocol Himadri Sekhar Paul. Arobinda Gupta. R. Badrinath. Dept. of Computer Sc. & Engg. Indian Institute of Technology,

Slides:



Advertisements
Similar presentations
Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.
Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Impossibility of Distributed Consensus with One Faulty Process
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Teaser - Introduction to Distributed Computing
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
University of Rostock Applied Microelectronics and Computer Science Dept.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Analysis of Using Broadcast and Proxy for Streaming Layered Encoded Videos Wilson, Wing-Fai Poon and Kwok-Tung Lo.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst I Koren and C M Krishna Electrical and Computer Engineering University of Massachusetts.
DCL Concepts STL Concepts ContainerIteratorAlgorithmFunctorAdaptor What New Concepts are Needed for a “DCL”? (Distributed Computing Library) Distributed.
Fundamentals of Computer Networks ECE 478/578 Lecture #2 Instructor: Loukas Lazos Dept of Electrical and Computer Engineering University of Arizona.
1 Performance Evaluation of Ring- based Peer-to-Peer Virtual Private Network (RING-P2P-VPN) Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Resilient P2P Anonymous Routing by Using Redundancy Yingwu Zhu.
1 Next Few Classes Networking basics Protection & Security.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Improving Routing in Sensor Networks with Heterogeneous Sensor Nodes Xiaojiang Du & Fengjing Lin Vehicular Technology Conference,2005 Spring,Volume 4.
Collision-free Time Slot Reuse in Multi-hop Wireless Sensor Networks
Databases Illuminated
2003/04/24AARON LEE 1 An Efficient K-hop Clustering Routing Scheme for Ad-Hoc Wireless Networks S. F. Hwang, C. R. Dow Journal of the Internet Technology,
SRL: A Bidirectional Abstraction for Unidirectional Ad Hoc Networks. Venugopalan Ramasubramanian Ranveer Chandra Daniel Mosse.
Complex Contagions Models in Opportunistic Mobile Social Networks Yunsheng Wang Dept. of Computer Science, Kettering University Jie Wu Dept. of Computer.
A Method for Distributed Computation of Semi-Optimal Multicast Tree in MANET Eiichi Takashima, Yoshihiro Murata, Naoki Shibata*, Keiichi Yasumoto, and.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 21: Designing.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Fault Management in Mobile Ad-Hoc Networks by Tridib Mukherjee.
Ching-Ju Lin Institute of Networking and Multimedia NTU
Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.
Brief Announcement : Measuring Robustness of Superpeer Topologies Niloy Ganguly Department of Computer Science & Engineering Indian Institute of Technology,
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.
ICSA 341 Data communications & Computer Networks Switching In the WAN, mesh networks are not practical for geographically spread areas with many nodes.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Amoeba Group Communication CS294-4 P2P Systems 2003 David Ratajczak.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
IIT Bombay 19 th Dec th Dec 2008 Tracking Dynamic Boundary Fronts using Range Sensors Subhasri Duttagupta (Ph. D student), Prof. Krithi Ramamritham.
Seminar On Rain Technology
Resource Selection in Grids Using Contract Net Kunal Goswami, Arobinda Gupta Cisco Systems, Bangalore, India Dept. of Computer Science & Engineering and.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
IT 242 Week 1 CheckPoint OSI Model To purchase this material link CheckPoint-OSI-Model For more courses.
Data Management on Opportunistic Grids
UbiCrawler: a scalable fully distributed Web crawler
PA an Coordinated Memory Caching for Parallel Jobs
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Coordination and computation over wireless networks
CS60002: Distributed Systems
Chapter 17: Database System Architectures
Commit Protocols CS60002: Distributed Systems
RELIABILITY.
CS 4594 Broadband PNNI Signaling.
Brahim Ayari, Abdelmajid Khelil, Neeraj Suri and Eugen Bleim
Database System Architectures
Last Class: Fault Tolerance
University of Wisconsin-Madison Presented by: Nick Kirchem
Brahim Ayari, Abdelmajid Khelil and Neeraj Suri
Presentation transcript:

Hierarchical Coordinated Checkpointing Protocol Himadri Sekhar Paul. Arobinda Gupta. R. Badrinath. Dept. of Computer Sc. & Engg. Indian Institute of Technology, Kharagpur, INDIA

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 2 2 Motivation Long running application executing on Distributed Systems. – Metacomputer running over WAN. Prone to failure, fault tolerance is important. – Checkpoint and recovery technique.

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 3 3 Motivation Coordinated Checkpointing protocol is a popular scheme. Coordinated checkpointing protocol is bottlenecked by the slowest link in the network. Hierarchical Coordinated Checkpointing Protocol caters for the heterogeneous link speed, as in WAN.

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 4 4 System Model Nodes are fail-safe. Network is immune to partitioning. Links are unreliable. All computing nodes are reachable from the others. Network is hierarchically connected – Clusters of computing nodes realized by high speed networks. – Clusters inter-connected by lower speed networks.

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 5 5 Cluster System Model Computation Nodes

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 6 6 Flat Coordinated Checkpointing Protocol (2-phase commit) Message Checkpoint Ckpt Rqst Ack Ckpt Rqst Ckpt Estb Ack Ckpt Estb Process blocked … Coordinator Follower

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 7 7 AckCkpt_commit Message Checkpoint Initiator Follower Leader Ckpt_rqst AckCkpt_rqst Ckpt_estb Ckpt_commit Blocked Blocking at Extra-cluster msg AckCkpt_rqst AckCkpt_estb Ckpt_rqst AckCkpt_rqst Ckpt_estb AckCkpt_estbAckCkpt_commit

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 8 8 Simulation Result Simulation Setup – Two level network, with intra-cluster link speed of 10 Mbps and inter-cluster link speed of 1 Mbps. – Communication pattern of the application is random. – Varying fraction of extra-cluster application message. (Flat = Flat Coordinated Checkpointing Protocol) (Hier = Hierarchical Coordinated Checkpointing Protocol)

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 9 9 Simulation Result

Dept. of Computer Sc. & Engg. IIT Kharagpur Hierarchical Coordinated Checkpointing Protocol 10 Conclusion & Future Work In a two-level hierarchical network the hierarchical checkpointing protocol incurs less latency than the flat checkpointing protocol, even for very high communication intensity. The protocol can be extended to a generic hierarchical network.