Download presentation
Presentation is loading. Please wait.
Published byBeryl Daisy Fleming Modified over 9 years ago
1
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, jmayo}@mtu.edu Department of Computer Science Michigan Technological University Houghton MI 49931, USA
2
Problem Teaching Parallel and Distributed Computing (PDC) is hard Even more challenging when teaching how PDC should be designed in the presence of faults
3
Objective Educate undergraduate students on the design, verification, and implementation of Fault-Tolerant Parallel and Distributed Computing (FTPDC).
4
Outline Three courses – Two core courses (geared towards design and verification) Model-Driven Software Development (CS4710) Software Quality Assurance (CS4712) – One elective (focused on implementation) Operating Systems (CS4411) Contributions: – Three course modules including lectures, demonstrations and assignments
5
DESIGN AND VERIFICATION MODULES
6
Courses in This Module Model-Driven Software Development (CS4710) – Lectures – Demos – Assignments Software Quality Assurance (CS4712) – Lectures – Demos – Assignments
7
CS4710 Lectures Models: basic concepts of parallel and distributed architectures (e.g., physical shared memory multiprocessors and computer clusters) – Inter-process communication mechanisms – Non-determinism – Fairness. Formal Representation: an automata-theoretic formalization of the basic concepts of PDC Formal Modeling in Promela: – The Process Meta Language (Promela) is the modeling language of the SPIN model checker – refines concepts in Promela
8
CS4710 Lectures - Continued Properties and Specifications: specify system requirements – in Hoare logic (e.g., pre-post conditions, loop invariants) – In Temporal logic (e.g., temporal operators “always” and “eventually” for safety and liveness specification) Concurrency in Promela: necessary skills for modeling and analyzing concurrency in – The shared memory model, and – The message-passing model (synchronous and asynchronous)
9
CS4710 Lectures - Continued Fault Modeling: introduces different types of faults (e.g., transient, permanent) – Model and analyze the impact of faults on system executions – Notions of Fault, Error and Failure (FEF) and their interdependencies. Levels of Fault tolerance: define three levels depending on the extent to which safety and liveness are met – Failsafe: always satisfy safety, even when faults occur – Nonmasking: guarantee recovery; safety is not a priority – Masking: both failsafe and nonmasking
10
CS4712 Lectures Self-Stabilization: basics of designing and verifying self-stabilization Design and Verification of Self-Stabilization: history of existing methods for design and verification of self-stabilizing system Algorithmic Design of Self-Stabilization: a preliminary background on automated design of stabilization using examples and tools developed by the Software Design Group (http://asd.cs.mtu.edu) at Michigan Tech
11
Assignments CS4710 – Non-determinism: how non-determinism could result in different paths of execution – Fairness: how fairness can simplify design – Safety and liveness: how to specify safety/liveness requirements in LTL and how to verify them in SPIN – Fault modeling: How to model faults in Promela – Levels of fault tolerance: How to verify fault tolerance CS4712 – How to test that a program is correct from any initial state? A self-stabilizing game Maximal matching network protocol that provides a matching from any arbitrary configuration
12
Demonstrations Used the model checker SPIN to give demos on the basic concepts – Non-determinism – Fairness – Inter-process communication in shared memory and message-passing models – Impact of faults – Fault tolerance
13
IMPLEMENTATION MODULE
14
Operating Systems Lectures Introduction to Distributed Systems – Models, faults, consistent global states – Consensus in a synchronous system with fail-stop failures Common Approaches – Isolation: approaches, VM example – Restart: checkpointing, rollback in distributed systems – Replication: approaches, Triple Modular Redundancy, stateless service and idempotent operations Reliability Block Diagrams – Constant component reliability
15
OS Assignments Checkpoint – Students already complete Nachos projects – Added implementation of checkpoint system call and test Reliability Block Diagrams – Traditional problems; developed RBDTool for demonstration and exploration – Source (and binaries for Windows, MacOS, Linux) at: http://www.cs.mtu.edu/~pjbonamy/rbdtool.html http://www.cs.mtu.edu/~pjbonamy/rbdtool.html Shared Memory Server – Support operations against a shared memory – Server must survive one crash failure
16
OS Experience Nachos checkpoint assignment very effective – Understanding return into ckpt required “breakthrough” – Enhanced understanding of binary format, process state, difficulty of process migration Difficult to fit topics into the course – Insufficient time for other meaningful programming assignments – Familiarity with reliability calculations; only simple systems Otherwise OS was a good fit – Enhanced coverage of RAID, virtual machines and implementation of processes
17
Conclusions Considered this pilot project a success All curriculum revisions will be retained Considering how to extend, reorganize the material – Design and implementation of modules for other courses: Team Software Project, Parallel Programming and Concurrent Computing – New course on modeling and analyzing fault-tolerant parallel/distributed computing systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.