Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department of Computer Science Michigan Technological University Houghton MI 49931, USA
Problem Teaching Parallel and Distributed Computing (PDC) is hard Even more challenging when teaching how PDC should be designed in the presence of faults
Objective Educate undergraduate students on the design, verification, and implementation of Fault-Tolerant Parallel and Distributed Computing (FTPDC).
Outline Three courses – Two core courses (geared towards design and verification) Model-Driven Software Development (CS4710) Software Quality Assurance (CS4712) – One elective (focused on implementation) Operating Systems (CS4411) Contributions: – Three course modules including lectures, demonstrations and assignments
DESIGN AND VERIFICATION MODULES
Courses in This Module Model-Driven Software Development (CS4710) – Lectures – Demos – Assignments Software Quality Assurance (CS4712) – Lectures – Demos – Assignments
CS4710 Lectures Models: basic concepts of parallel and distributed architectures (e.g., physical shared memory multiprocessors and computer clusters) – Inter-process communication mechanisms – Non-determinism – Fairness. Formal Representation: an automata-theoretic formalization of the basic concepts of PDC Formal Modeling in Promela: – The Process Meta Language (Promela) is the modeling language of the SPIN model checker – refines concepts in Promela
CS4710 Lectures - Continued Properties and Specifications: specify system requirements – in Hoare logic (e.g., pre-post conditions, loop invariants) – In Temporal logic (e.g., temporal operators “always” and “eventually” for safety and liveness specification) Concurrency in Promela: necessary skills for modeling and analyzing concurrency in – The shared memory model, and – The message-passing model (synchronous and asynchronous)
CS4710 Lectures - Continued Fault Modeling: introduces different types of faults (e.g., transient, permanent) – Model and analyze the impact of faults on system executions – Notions of Fault, Error and Failure (FEF) and their interdependencies. Levels of Fault tolerance: define three levels depending on the extent to which safety and liveness are met – Failsafe: always satisfy safety, even when faults occur – Nonmasking: guarantee recovery; safety is not a priority – Masking: both failsafe and nonmasking
CS4712 Lectures Self-Stabilization: basics of designing and verifying self-stabilization Design and Verification of Self-Stabilization: history of existing methods for design and verification of self-stabilizing system Algorithmic Design of Self-Stabilization: a preliminary background on automated design of stabilization using examples and tools developed by the Software Design Group ( at Michigan Tech
Assignments CS4710 – Non-determinism: how non-determinism could result in different paths of execution – Fairness: how fairness can simplify design – Safety and liveness: how to specify safety/liveness requirements in LTL and how to verify them in SPIN – Fault modeling: How to model faults in Promela – Levels of fault tolerance: How to verify fault tolerance CS4712 – How to test that a program is correct from any initial state? A self-stabilizing game Maximal matching network protocol that provides a matching from any arbitrary configuration
Demonstrations Used the model checker SPIN to give demos on the basic concepts – Non-determinism – Fairness – Inter-process communication in shared memory and message-passing models – Impact of faults – Fault tolerance
IMPLEMENTATION MODULE
Operating Systems Lectures Introduction to Distributed Systems – Models, faults, consistent global states – Consensus in a synchronous system with fail-stop failures Common Approaches – Isolation: approaches, VM example – Restart: checkpointing, rollback in distributed systems – Replication: approaches, Triple Modular Redundancy, stateless service and idempotent operations Reliability Block Diagrams – Constant component reliability
OS Assignments Checkpoint – Students already complete Nachos projects – Added implementation of checkpoint system call and test Reliability Block Diagrams – Traditional problems; developed RBDTool for demonstration and exploration – Source (and binaries for Windows, MacOS, Linux) at: Shared Memory Server – Support operations against a shared memory – Server must survive one crash failure
OS Experience Nachos checkpoint assignment very effective – Understanding return into ckpt required “breakthrough” – Enhanced understanding of binary format, process state, difficulty of process migration Difficult to fit topics into the course – Insufficient time for other meaningful programming assignments – Familiarity with reliability calculations; only simple systems Otherwise OS was a good fit – Enhanced coverage of RAID, virtual machines and implementation of processes
Conclusions Considered this pilot project a success All curriculum revisions will be retained Considering how to extend, reorganize the material – Design and implementation of modules for other courses: Team Software Project, Parallel Programming and Concurrent Computing – New course on modeling and analyzing fault-tolerant parallel/distributed computing systems