Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Slides:



Advertisements
Similar presentations
Processes Management.
Advertisements

Chap 2 System Structures.
Operating-System Structures
Introduction CSCI 444/544 Operating Systems Fall 2008.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
PRASHANTHI NARAYAN NETTEM.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 2: Operating-System Structures Operating.
Chapter 3 Process Description and Control
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
المحاضرة الاولى Operating Systems. The general objectives of this decision explain the concepts and the importance of operating systems and development.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-1 Process Concepts Department of Computer Science and Software.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)
Beowulf Software. Monitoring and Administration Beowulf Watch 
1 Threads, SMP, and Microkernels Chapter 4. 2 Process Resource ownership: process includes a virtual address space to hold the process image (fig 3.16)
Processes – Part I Processes – Part I. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Review on OSs Upon brief introduction of OSs,
Processes Introduction to Operating Systems: Module 3.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
We will focus on operating system concepts What does it do? How is it implemented? Apply to Windows, Linux, Unix, Solaris, Mac OS X. Will discuss differences.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real.
Introduction to Operating Systems Concepts
Chapter 1: Introduction
Applied Operating System Concepts
Processes and threads.
Process Management Process Concept Why only the global variables?
CS 6560: Operating Systems Design
Lesson Objectives Aims Key Words
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 9 – Real Memory Organization and Management
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Chapter 1: Introduction
Threads Chapter 4.
Operating Systems Lecture 3.
Chapter 1: Introduction
Chapter 1: Introduction
Operating System Introduction.
Chapter 1: Introduction
System calls….. C-program->POSIX call
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Introduction and History
Presentation transcript:

Yavor Todorov

Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References Contents

Basically, checkpoint/restart mechanisms allow a machine that crashes and is subsequently restarted to continue from the checkpoint with no loss of data, just as if no failure had occurred. At some firms and supercomputing centers, it's common practice to break up long-running computational programs into several batches. Programs such as a gene- sequencing application search through enormous databases and execute complex algorithms that can take several weeks to complete. But while the concept is easy to understand, the technical mechanism to checkpoint and restart an operating system or application is quite complex. Introduction

Checkpointing can occur either within the operating system or at the application level. Most high end mainframes have automated CPR utilities CPR at the operating system level saves the state of everything that's being done within a given application at periodic checkpoints and allows the system to restart from the last point. On very large computers with hundreds or thousands of processes running, saving the entire state of an operating system can take a long time How it works

It also takes a long time to later restart the machine at that state - on large jobs, it could take several hours. The recovery is delayed because a large amount of data must be stored, whether or not the application requires that information to fully restart it. When a process is checkpointed register set, file handlers How it works (cont’d)

Text Data Stack Heap Register set values Status of open files Sockets Signals. Process image

"Checkpointing at the operating system is useful but very costly, in that the operating system does not know what data the application really needs to restore it later, so it blindly saves everything," according to James Kasdorf, director of supercomputing center At OS level CPR saves system state. That includes unneeded copies of data, program code and system libraries Most supercomputing centers try to avoid CPR at OS level Checkpoint at OS level

Since the whole process image is save its an expensive operation Takes more time Usually needs kernel modifications since most OS like Linux were not built with CPR functionality Checkpointing at OS level drawbacks

The application uses OS hooks to save information needed for restart More efficient in a way that it takes less time to checkpoint and it is faster to restart the application It allows you to choose optimal point which is typically at the end of a loop Only needed data gets saved Checkpoint at application level

Difficult in some cases.E.g. application has an open communications channel to an external device or the application runs on a clustered computer. Distributed application’s state is hard to save as programs state is changing across multiple nodes CPR for apps with large buffer memory takes longer CPR at app level drawbacks

Each process is responsible for taking its own checkpoint. Checkpoint timing is responsibility of a coordinating process. CPR data includes: in-transit message data, data section, file offsets, signal state, executable information, stack contents and register contents, CPU state, info about open files, pending signals. Checkpoint file can be stored either on local or global storage. When program is restarted each process initiates its own restart. CPR for parallel programs

All migrating processes have to be stopped at the time, to avoid loss of a signal Socket IP addressing space have to be taken in consideration( it can be virtualized) I/O speeds are pivotal for any CPR process ( the faster the better) CPR for parallel programs

Process migration Load balancing Crash recovery Rollback transaction Job control CPR functionality

Duell, J. (2005). The design and implementation of Berkeley Lab's linux checkpoint/restart. Berkeley: Lawrence Berkeley National Lab. Litzkow, M., Tannenbaum, T., Basney, J., & Livny, M. (1997). Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Zhong, H., & Nieh, J. (2001). CRAK: Linux Checkpoint/Restart As a Kernel Module. Depertment of CS Columbia University. Sources