Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.

Similar presentations


Presentation on theme: "Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department."— Presentation transcript:

1 Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, jmayo}@mtu.edu Department of Computer Science Michigan Technological University Houghton MI 49931, USA

2 Problem Teaching Parallel and Distributed Computing (PDC) is hard Even more challenging when teaching how PDC should be designed in the presence of faults

3 Objective Educate undergraduate students on the design, verification, and implementation of Fault-Tolerant Parallel and Distributed Computing (FTPDC).

4 Outline Three courses – Two core courses (geared towards design and verification) Model-Driven Software Development (CS4710) Software Quality Assurance (CS4712) – One elective (focused on implementation) Operating Systems (CS4411) Contributions: – Three course modules including lectures, demonstrations and assignments

5 DESIGN AND VERIFICATION MODULES

6 Courses in This Module Model-Driven Software Development (CS4710) – Lectures – Demos – Assignments Software Quality Assurance (CS4712) – Lectures – Demos – Assignments

7 CS4710 Lectures Models: basic concepts of parallel and distributed architectures (e.g., physical shared memory multiprocessors and computer clusters) – Inter-process communication mechanisms – Non-determinism – Fairness. Formal Representation: an automata-theoretic formalization of the basic concepts of PDC Formal Modeling in Promela: – The Process Meta Language (Promela) is the modeling language of the SPIN model checker – refines concepts in Promela

8 CS4710 Lectures - Continued Properties and Specifications: specify system requirements – in Hoare logic (e.g., pre-post conditions, loop invariants) – In Temporal logic (e.g., temporal operators “always” and “eventually” for safety and liveness specification) Concurrency in Promela: necessary skills for modeling and analyzing concurrency in – The shared memory model, and – The message-passing model (synchronous and asynchronous)

9 CS4710 Lectures - Continued Fault Modeling: introduces different types of faults (e.g., transient, permanent) – Model and analyze the impact of faults on system executions – Notions of Fault, Error and Failure (FEF) and their interdependencies. Levels of Fault tolerance: define three levels depending on the extent to which safety and liveness are met – Failsafe: always satisfy safety, even when faults occur – Nonmasking: guarantee recovery; safety is not a priority – Masking: both failsafe and nonmasking

10 CS4712 Lectures Self-Stabilization: basics of designing and verifying self-stabilization Design and Verification of Self-Stabilization: history of existing methods for design and verification of self-stabilizing system Algorithmic Design of Self-Stabilization: a preliminary background on automated design of stabilization using examples and tools developed by the Software Design Group (http://asd.cs.mtu.edu) at Michigan Tech

11 Assignments CS4710 – Non-determinism: how non-determinism could result in different paths of execution – Fairness: how fairness can simplify design – Safety and liveness: how to specify safety/liveness requirements in LTL and how to verify them in SPIN – Fault modeling: How to model faults in Promela – Levels of fault tolerance: How to verify fault tolerance CS4712 – How to test that a program is correct from any initial state? A self-stabilizing game Maximal matching network protocol that provides a matching from any arbitrary configuration

12 Demonstrations Used the model checker SPIN to give demos on the basic concepts – Non-determinism – Fairness – Inter-process communication in shared memory and message-passing models – Impact of faults – Fault tolerance

13 IMPLEMENTATION MODULE

14 Operating Systems Lectures Introduction to Distributed Systems – Models, faults, consistent global states – Consensus in a synchronous system with fail-stop failures Common Approaches – Isolation: approaches, VM example – Restart: checkpointing, rollback in distributed systems – Replication: approaches, Triple Modular Redundancy, stateless service and idempotent operations Reliability Block Diagrams – Constant component reliability

15 OS Assignments Checkpoint – Students already complete Nachos projects – Added implementation of checkpoint system call and test Reliability Block Diagrams – Traditional problems; developed RBDTool for demonstration and exploration – Source (and binaries for Windows, MacOS, Linux) at: http://www.cs.mtu.edu/~pjbonamy/rbdtool.html http://www.cs.mtu.edu/~pjbonamy/rbdtool.html Shared Memory Server – Support operations against a shared memory – Server must survive one crash failure

16 OS Experience Nachos checkpoint assignment very effective – Understanding return into ckpt required “breakthrough” – Enhanced understanding of binary format, process state, difficulty of process migration Difficult to fit topics into the course – Insufficient time for other meaningful programming assignments – Familiarity with reliability calculations; only simple systems Otherwise OS was a good fit – Enhanced coverage of RAID, virtual machines and implementation of processes

17 Conclusions Considered this pilot project a success All curriculum revisions will be retained Considering how to extend, reorganize the material – Design and implementation of modules for other courses: Team Software Project, Parallel Programming and Concurrent Computing – New course on modeling and analyzing fault-tolerant parallel/distributed computing systems


Download ppt "Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department."

Similar presentations


Ads by Google