Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari.
Some Basic Terminology. What is a Process? A process is an entity that is actually running in an operating system. What does Introspection mean? Introspection means understanding one’s inner self. ( Merriam-Webster Online)
Goals of the Process Introspection Project. To construct a checkpoint/restart mechanism for a heterogeneous environment. This mechanism should be: 1. Efficient, 2. Flexible, 3. Most importantly platform independent.
Heterogeneous Environment. Became famous mainly due to their better price/performance ratio. Some characteristics : 1. A conglomeration of workstations running on different operating systems and varied architecture bound together using a network line. 2. Generally used for computing intensive applications where many workstations that are idle/having less load participate in finishing of a task, providing efficient utilization of idle time. 3. User Dedicated machines.
Ex: Our Own Department.
Efficient Utilization. To take the advantage of these heterogeneous workstations, the following schemes should be provided to the processes: 1. Process Migration. 2. Load Balancing. 3. Fault Tolerance.
Checkpoint/Restart Mechanism. Mainly Two Phases: 1. To save the current running state of a process. 2. Reconstruct the original running process from the saved image and resume the execution from exactly the interrupted point.
Advantages of using the Checkpoint/Restart Mechanism. Process Migration. 1. Distributed Load Balancing. 2. Efficient Resource Utilization. Crash Recovery and Rollback Transaction. Useful in System Administration. Lowering the Programming Burden. Running complex simulation or complex modeling.
Implementation Challenges/Complexity. Due to the heterogeneous nature of the computing environment the checkpoint/restart mechanism should be platform independent. 1. Capture a state of a running process. 2. Reinstantiate it on a completely different architecture or OS platform which consist of a different instruction set, data format, address space layout.
Existing Implementations. V migration mechanism. Compiler support is used to generate meta information about a process describing the locations and types of data items to be modified at migration time to mask data representation differences. Disadvantages: 1. Requires Kernel Support. Some other examples: MOSIX, Sprite. 2. Requires data to be stored at the same address in all migrated versions. Theimer and Hayes. Construct an intermediate source code representation of a running process at the point of migration, and to recompile this source at migration target. Never been implemented.
Process Introspection Design. Process + Introspection : The ability of a process to examine and describe its own internal state in a logical, and platform independent format. Extends the technique of handcoding checkpoint/restart mechanism into an automated approach.
Components Involved. The Process Introspection Design Pattern. Process Introspection Library (PIL). Automatic Process Introspection Compiler (APrIL). Standard Checkpoint Interface. Central Checkpoint Coordinator.
Process Introspection Design Pattern. A design template for writing checkpointable codes. Based on a Process Model.
Adding functionality to the modules. Ability to save/restore threads of control. 1. Poll points (checkpoint requests) inserted to save call stacks. - Poll point placement is a key performance trade-off issue. 2. Serving a Checkpoint Request. save data and logical point of execution and return to its calling subroutine. 3. Restart a process from checkpoint. restore the variables from the checkpoint and use control flow to reach the correct point of execution, as mentioned by the checkpoint from the initial subroutine that is active at the checkpoint. Call the next subroutine from the checkpointed stack.
Adding functionality to the process contd... Ability to save/restore memory blocks. -- Should take care of different data representation and address space layout on different platforms. For pointers. 1. Can’t save a raw memory address. 2. Have to save a logical description. High level descriptors are needed.
APrIL Compiler Transformed Code Hand-coded Checkpointable Modules Process Introspection Library (PIL) Checkpoint
Process Introspection Library (PIL). A consistent API for manipulating the elements of a process. Automates and integrates: Thread management. Logical Program Counter Stack. Data format conversion. Checkpoint/restart of statically allocated data. Checkpoint/restart of dynamically allocated data. Pointer analysis/description.
APrIL: Automatic Process Introspection Compiler. A source code translator. Inserting code to keep the PIL tables updated during run time. Placement of Poll Points in the module code as the thread executing code in the module periodically polls for checkpoint requests. During restart, process must restore all threads of execution.
Example - Function Prologues void example(double *A) { int i; double temp[100]; PIL_RegisterStackPointer(temp,PIL_Double,100); if(PIL_CheckpointStatus&PIL_StatusRestoreNow) { int PIL_restore_point = PIL_PopLPCValue(); A = PIL_RestoreStackPointer(); i = PIL_RestoreStackInt(); PIL_RestoreStackDoubles(temp,100); switch(PIL_restore_point) { case 1: PIL_DoneRestart(); goto _PIL_PollPt_1; case 2: goto _PIL_PollPt_2; case 3: PIL_DoneRestart(); goto _PIL_PollPt_3; }
Example - Poll Points _PIL_PollPt_2: i = function(A,X,100); _PIL_PollPt_3: if(PIL_CheckpointStatus&PIL_StatusCheckpointNow) { if(PIL_CheckpointStatus&PIL_StatusCheckpointInProgress) PIL_PushLPCValue(2); else { PIL_PushLPCValue(3); PIL_CheckpointStatus|=PIL_StatusCheckpointInProgress; } goto _PIL_save_frame_; }. _PIL_save_frame_: PIL_SaveStackPointer(A); PIL_SaveStackInt(i); PIL_SaveStackDoubles(X,100); return;
APrIL: Automatic Process Inrospection Compiler. High Level Language Transformed Code Binary 1Binary 2 Binary N APrIL Back End Compilers PIL
Checkpoint Coordination and Module Interfaces. Helps in achieving interoperation of modules to produce checkpoint or restart processes. SCI events: Process Startup. – registers any global or data type definitions Checkpoint Start/End – information of the module Restart. – restoring the state from checkpoint.
Judging an implementation. Little or no Programmer effort. Convenient Programmer Interface. Low Checkpoint Request Service Latency. Low Runtime Overhead. Control over the number of checkpoints. Should mix with the environment.
Example Overhead Measurements Run times in seconds, Latencies in milliseconds
Project Status. Prototype PIL implemented. Tested across multiple platforms : Solaris, IRIX, AIX, OSF1, Linux, Win95/NT. Example applications demonstrated E.g. matrix multiply, SOR, sort. Hand coded to use PIL. Checkpointed /restarted across above platforms. APrIL under design and construction
References. The Process Introspection Project. Transparent Checkpointing under Unix. J.S. Plank, M.Beck, G. Kingsley, and K. Li. CRAK: Linux Checkpoint/Restart As a Kernel Module. Hua Zhong and Jason Nieh (Linux taken as example to explain the design concepts).
Thank you. Questions ??