Recovery-Oriented Computing User Study Training Materials October 2003
Slide 2 Overview Informed consent & Introduction User study scenario & your role Training (20 minutes) Two study sessions (30 minutes each) Wrapup and questionnaire
Slide 3 Informed Consent Please read the overview of the study and the informed consent form – please feel free to ask any questions you have about the experiment, its goals, its procedures, etc. If you agree to participate in the experiment, please sign the informed consent form
Slide 4 Introduction This study is evaluating new recovery tools – the tools are designed to help system administrators recover from problems affecting server systems You will be playing the role of a system administrator – in each of two sessions, you will be trying to recover an server system from a pre-existing problem
Slide 5 Introduction (2) In each session, you may (or may not) be given an experimental recovery tool to use We are trying to understand when the tool is useful for you and when it is not – so if you are given the tool, please think carefully about whether or not to use it when you are attempting to recover from a problem » at the end of the session, you will be asked to explain why you chose to use (or not use) the tool
Slide 6 The Scenario
Slide 7 User Study Scenario You are one of several system administrators of an electronic mail ( ) service – the administrators work in shifts – the study starts when you arrive for your shift You arrive to find users complaining that the service is not working – you will be provided with details of the complaint – the failure may be caused by: » failure of the software, or » an error made by the administrator on the previous shift
Slide 8 User Study Scenario: Your Role Your responsibilities and goals: – restore the service to normal operation as quickly as possible – minimize the amount of lost and user work Note: – you should prioritize restoring service over preserving changes made by other administrators
Slide 9 User Study Scenario: Resources Resources you will have: – a log of all actions performed by administrators in previous shifts – a day-old backup of the server’s file systems – the Internet – a test account – a guru » during each session, you may make up to one request for help to the guru Plus any experimental recovery tool that we provide (described later)
Slide 10 Training: Server
Slide 11 Overview This study concerns store servers – stores receive and store for their users » users’ mailboxes live on the store – they do not handle sending or routing of outgoing mail stores use two protocols – SMTP: used to deliver incoming to a mailbox » SMTP is spoken between a remote server that sends the message, and the local recipient store server – IMAP: used to retrieve & manipulate mail in a mailbox » IMAP is spoken between a user’s client and their local store server
Slide 12 Server Configuration Mailboxes are text files in /var/mail, e.g. /var/mail/user173 sendmail: process that receives and delivers incoming imapd: process that provides remote access to mailboxes Mail store configuration files can be found in /etc/mail Server (Linux) undovmN.cs.berkeley.edu N={1,2,3} Mailboxes /var/mail/userNNN SMTP Server Process sendmail IMAP Server Process imapd Internet incoming Users reading SMTPIMAP
Slide 13 Simple Familiarization Task Take some time to get familiar with the console and the system – by performing a basic task as described below Goals: – ensure sendmail is running – reconfigure server to recognize mail sent to – restart sendmail to activate reconfiguration First step: – connect to undovm3.cs.berkeley.edu with ssh continues...
Slide 14 Simple Familiarization Task (2) Next, check if sendmail is running: – execute the command: ps ax | grep sendmail Reconfigure server to accept new host name: – edit /etc/mail/local-host-names to add the line: roc.cs.berkeley.edu Finally, restart sendmail: – run /etc/init.d/sendmail restart Try this task now!
Slide 15 Training: Experimental Recovery Tool
Slide 16 Recovery Tool: an Undo System The undo system can undo administrative changes to the store, including: – changes to configuration files – software upgrades – deleted or altered files It can be used to restore the server to a previously known-good state – by “rewinding” to a date when the system worked OK The undo system preserves incoming and user mailbox changes
Slide 17 When Can the Undo System Help? The undo system is useful: – when you cannot tell what is causing a problem » but you know that the system was working at some point in the past – when a problem affects system state » typically, the same cases where restoring a backup would fix the problem It does not help when the problem does not affect state – like if a server process (e.g., sendmail) has crashed cleanly without corrupting state
Slide 18 Why Use the Undo System? Unlike using a backup, the undo system also repairs the side effects of problems – example: if a problem caused to be lost, using undo to fix the problem will restore the lost » the undo system does this by recording incoming and users’ mailbox edits, then restoring them during recovery Undo is also useful when you cannot diagnose a problem – simply undo the system to a point in time when it was known to be working
Slide 19 Undo System Operation An undo cycle has two stages: – rewind: the system’s state is reverted to the way it appeared at a past time (the “rewind point”) » all changes to the system made since the rewind point are undone, including: changes made by administrators changes due to software bugs incoming delivery and user mailbox edits – commit: makes the rewind permanent but restores incoming & user mailbox edits to present time Net effect: undo cycle undoes all changes except incoming and mailbox edits
Slide 20 Illustration of Undo Cycle Before undo: After rewind: After commit: time Rewind point admin changes user events (incoming , mailbox edits) time admin changes user events (incoming , mailbox edits) admin changes user events (incoming , mailbox edits) user event admin change undone changes restored user events note that admin changes remain undone
Slide 21 Controls for the Undo System Rewind: begins an undo cycle – defines a rewind point and undoes all later changes – may cause server to automatically reboot – takes 4 to 5 minutes to execute Commit: completes the undo cycle – makes the rewind permanent » restores incoming & mailbox edits to present time – takes about 5 minutes to execute Cancel: aborts the undo cycle – restores server to the state it was in before rewinding
Slide 22 Undo System Interface Main window: normal state » time is divided into 5-minute intervals » each interval contains user events like incoming mail » it’s fastest to rewind to a checkpoint Timeline (color indicates relative load) Current time Checkpoints Current undo status Intervals containing checkpoints Intervals
Slide 23 Undo System Interface (2) Main window: rewound state Current undo status Commit and Cancel buttons Current time (in the past) indicates undo point History of undo operations
Slide 24 Undo System Interface (3) Event window – used to initiate rewind – to view, double-click on an interval in main window Selected event (rewind point) Current time Click to invoke undo cycle Description of event (here, user170 is examining their mailbox) Event sequence #
Slide 25 Familiarization, Part II Try out the undo system interface – note: actually performing an undo cycle may take 10 or more minutes to complete Familiarize yourself with the various resources available to you during the study – Outlook Express client – the test account: N={1,2,3} – the system backup: /backup – books, documentation, the Internet – guru advice: at most one question per session
Slide 26 Resources for More Information in general – About Internet protocols – references: Sendmail – O’Reilly Sendmail book (next to your workstation) – Sendmail home page: – SMTP RFC: IMAPd – IMAP general info: – UW-IMAP home page: – IMAP RFC: