CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA
Checkpoints Many fault-tolerant systems need to create and recover from some form of checkpoint Many systems use “transactions”, our main topic next week. These systems can be understood as periodically entering a checkpoint state A common method for dealing with failure is to simply restart the program from a checkpoint For long-running scientific computing, checkpoint creation often uses compiler techniques An important issue if your program will run for weeks!
Common approach Periodically make a “big” checkpoint Then can more frequently make an incremental addition to it For example: the checkpoint could be copies of some files or of a database Looking ahead, the incremental data could be “operations run on the database since the last transaction finished (committed)”
What needs to be in a checkpoint? Scientific computing system might have massive data structures while it runs But perhaps not all needs to be “checkpointed” For example, if it uses big but unchanging tables, why write them out? In general, checkpoint only needs to include data that can’t be “regenerated” in some simple, quick way
Checkpoints have other uses We discussed two styles of air traffic control systems Recall that in the French system, normal programs can join “groups” within which data is replicated One group per pattern of replication Data might be, e.g., “information about ATC sector D-9”
Checkpoints Programs r,s, t join group for ATC sector “D-9” Group members replicate the associated data crash D-9 0 ={p,q} D-9 1 ={p,q,r,s} D-9 2 ={q,r,s} D-9 3 ={q,r,s,t} pqrstpqrst r, s request to join r,s added; state xfer t added, state xfer t requests to join p fails p makes a checkpoint r, s initialize from checkpoint q makes a checkpoint t inializes from checkpoint
What needs to be in a state transfer? Depends on the situation We use state transfer in a group replicating some variable, set of variables, or data structure Checkpoint should include the data in that group Just like printing the data, or writing it to a file, except that we use an in-memory structure In C#, this is called an “in-memory serialization method” Interface is “iserializable” and you can write the code to produce the serialized version yourself
State transfer So to transfer state Pick an appropriate point in the “timeline” Basically, a cut with respect to incoming multicasts Ask some member of the group to checkpoint Also called the “leader election” problem Easy solution: the “oldest” current member It writes this checkpoint to a byte stream Data sent to the new member(s) They rebuild the data structure out of the data as they read it in We’ll see how these subproblems can be solved in lectures over the coming weeks
Still more uses of checkpoints In primary-backup systems We can stream data to the backup and keep it “warm” Or we can create checkpoints and have the backup restart from them Sometimes this is done using some form of shared media, like a dual-ported disk Backup reads the checkpoint when appropriate
Extreme checkpointing At the extreme, a checkpoint could include the entire state of a process Write out its memory “layout” Contents of all pages Contents of registers Now can restart the process by simply reloading its entire state. Windows XP does this for “hibernate” feature; Linux has a similar feature Potentially, very fast
Extreme checkpointing Worry here is that if a program is “temporarily deterministic” it may Crash due to a corrupt data structure Roll back and reload that same structure Crash again Advantage of “rebuilding” data structures is that we avoid this risk
Checkpointing limitations Coping with input channels At a minimum must reopen them, set seek pos’n Dealing with non-determinism Sources include multi-threading Applications that receive user input, timer interrupts, I/O from devices, or messages on multiple connections Basic concern: What if, after roll back to a checkpoint, the application doesn’t repeat the actions that occurred “last time” the process was in that same state
Problems with checkpoints P and Q are interacting Each makes checkpoints now and then p q requestreply
Problems with checkpoints Q crashes and rolls back to checkpoint p q requestreply
Problems with checkpoints Q crashes and rolls back to checkpoint It will have “forgotten” message from P p q request
Problems with checkpoints … Yet Q may even have replied. Who would care? Suppose reply was “OK to release the cash. Account has been debited” p q requestreply
Two related concerns First, Q needs to see that request again, so that it will reenter the state in which it sent the reply Need to regenerate the input request But if Q is non-deterministic, it might not repeat those actions even with identical input So that might not be “enough”
Rollback can leave inconsistency! In this example, we see that checkpoints must somehow be coordinated with communication If we allow programs to communicate and don’t coordinate checkpoints with message passing, system state becomes inconsistent even if individual processes are otherwise healthy
More problems with checkpoints P crashes and rolls back p q requestreply
More problems with checkpoints P crashes and rolls back Will P “reissue” the same request? Recall our non-determinism assumption: it might not! p q requestreply
Solution? One idea: if a process rolls back, roll others back to a consistent state If a message was sent after checkpoint, roll receiver back to a state before that message was received If a message was received after checkpoint roll the sender back to a state prior to sending it Assumes channels will be “empty” after doing this
Problems with checkpoints Q crashes and rolls back p q requestreply
Problems with checkpoints Q crashes and rolls back p q requestreply q rolled back to a state before this was received, or reply was sent
Problems with checkpoints P must also roll back Now it won’t upset us if P happens not to resend the same request p q
Problems with checkpoints But now we can get a cascade effect p q
Problems with checkpoints Q crashes, restarts from checkpoint… p q
Problems with checkpoints Forcing P to rollback for consistency… p q
Problems with checkpoints New inconsistency forces Q to rollback ever further p q
Problems with checkpoints New inconsistency forces P to rollback ever further p q
This is a “cascaded” rollback It arises when the creation of checkpoints is uncoordinated w.r.t. communication Can force a system to roll back to initial state Clearly undesirable in the extreme case… Could be avoided in our example if we had a log for the channel from P to Q
Sometimes action is “external” to system Suppose that P is an ATM machine Asks: Can I give Ken $100 Q debits account and says “OK” P gives out the money We can’t roll P back in this case since the money is already gone
External actions In fact dealing with external actions is a bit like Sam and Jill’s lunch date At best we can checkpoint right before issuing cash from the ATM We can’t get a stronger certainty… so may have to audit the ATM machine after a nasty crash and rollback We won’t discuss this more, but keep in mind that the world is full of limits… sigh…
Bigger issue is non-determinism P’s actions could be tied to something random For example, perhaps a timeout caused P to send this message After rollback these non-deterministic events might occur in some other order Results in a different behavior, like not sending that same request… yet Q saw it, acted on it, and even replied!
Issue has two sides One involves reconstructing P’s message to Q in our examples We don’t want P to roll back, since it might not send the same message But if we had a log with P’s message in it we would be fine, could just replay it The other is that Q might not send the same response (non-determinism) If Q did send a response and doesn’t send the identical one again, we must roll P back
Options? One idea is to coordinate the creation of checkpoints and logging of messages In effect, find a point at which we can pause the system All processes make a checkpoint in a coordinated way (“consistent snapshot”) Then resume Protocols for doing this are well known and isomorphic to to consistent cuts
Why isn’t this common? Often we can’t control processes we didn’t code ourselves Most systems have many black-box components Can’t expect them to implement the checkpoint/rollback policy Hence it isn’t really practical to do coordinated checkpointing if it includes system components
Why isn’t this common? Further concern: not every process can make a checkpoint “on request” Might be in the middle of a costly computation that left big data structures around Or might adopt the policy that “I won’t do checkpoints while I’m waiting for responses from black box components” This interferes with coordination protocols
Implications? Some researchers have studied ensuring that devices, timers, etc, can behave identically if we roll a process back and then restart it This approach was common in 1980’s For example, the “Swallow” operating system Knowing that programs will re-do identical actions eliminates need to cascade rollbacks
Implications? Must also cope with thread preemption Occurs when we use lightweight threads, as in Java or C# Thread scheduler might context switch at times determined by when an interrupt happens Must force the same behavior again later, when restarting, or program could behave differently Schneider/Bressoud: showed how to do this with a special “microcycle” timer register in hardware But not common on modern CPUs
Determinism Despite these issues, often see mechanisms that assume determinism Basically they are saying Either don’t use threads, timers, I/O from multiple incoming channels, shared memory, etc Or use a “determinism forcing mechanism” like the Schneider/Bressoud idea
With determinism… We can revisit the checkpoint rollback problem and do much better Eliminates need for cascaded rollbacks But we do need a way to replay the identical inputs that were received after the checkpoint was made Forces us to think about keeping logs of the channels between processes
Three popular options Receiver based logging Log received messages; like an “extension” of the checkpoint Sender based logging Log messages when you send them, ensures you can resend them if needed Mixed mode (Alvisi) Does both, optimizes to log where doing so is most efficient (results in smallest log/overhead)
Why do these work? Recall the reasons for cascaded rollback A cascade occurs if Q received a message, then rolls back to “before” that happened Now, Q can regenerate the input and re-read the message Only works for messages sent if we have deterministic processes, but often some are deterministic even if others aren’t
With these varied options When Q rolls back we can Re-run Q with identical inputs if Q is deterministic, or Nobody saw messages from Q after checkpoint state was recorded, or We roll back the receivers of those messages An issue: deterministic programs often crash in the identical way if we forced identical execution But here we have flexibility to either force identical executions or do a coordinated rollback
Alvisi developed a general theory Imagine a set of “dials” for each program One shows the estimated cost of sender logging Another shows estimated cost of receive logging One is a switch: deterministic/non-deterministic A meter gives “current cost of doing a checkpoint” Alvisi can collect this sort of input and offer choices to the system Idea is to mix and match, picking cheap solutions Then do coordinated rollback selectively
Connection to consistent cuts A system-wide checkpoint is just a consistent snapshot Checkpoints for each process, plus Log of contents of each channel, Key insight is that we can get this behavior in ways that also exploit optimizations where possible In fact the algorithm is extremely similar and we won’t cover it in detail today
Take-aways? Fault-tolerant systems often use forms of replication to gain availability Including replicating process state by keeping a checkpoint And messages, by logging at sender or receiver And perhaps even recording unpredictable stuff like scheduling decisions, PC when a thread was prempted, or what a system call returned Checkpoint/rollback is best seen as an instance of a broader approach
Take-aways? What makes it hard? It can be slow/expensive to checkpoint in some situations Coordinating the actions of processes is often needed because programs don’t live in isolation Consistency is an underlying theme Outside user shouldn’t be able to tell that restart occurred – it should be “hidden”
Open questions? We haven’t discussed failure detection Discovery of a fault triggers recovery But what if our detector makes mistakes? Could, for example, mistake a timeout for evidence of a crash In upcoming lectures we’ll look at mechanisms a system can use to track its own state and maintain consistency