Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

A LEGO-like Lightweight Component Architecture for Organic Computing Thomas Schöbel-Theuer, Universität Stuttgart
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Introducing Formal Methods, Module 1, Version 1.1, Oct., Formal Specification and Analytical Verification L 5.
Pontus Boström and Marina Waldén Åbo Akademi University/ TUCS Development of Fault Tolerant Grid Applications Using Distributed B.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
NSF/TCPP Early Adopter Experience at Jackson State University Computer Science Department.
Making Services Fault Tolerant
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Model Checking. Used in studying behaviors of reactive systems Typically involves three steps: Create a finite state model (FSM) of the system design.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Synthesis of Fault-Tolerant Distributed Programs Ali Ebnenasir Department of Computer Science and Engineering Michigan State University East Lansing MI.
Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer.
The Complexity of Adding Failsafe Fault-tolerance Sandeep S. Kulkarni Ali Ebnenasir.
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Automatic Synthesis of Fault-Tolerance Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer Science and Engineering Department Michigan.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
The Model Checker SPIN Written by Gerard J. Holzmann Presented by Chris Jensen.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Advances in Language Design
Cheng/Dillon-Software Engineering: Formal Methods Model Checking.
Distributed Systems (15-440) Mohammad Hammoud December 4 th, 2013.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
Thinking in Parallel Adopting the TCPP Core Curriculum in Computer Systems Principles Tim Richards University of Massachusetts Amherst.
Advanced Operating Systems CIS 720 Lecture 1. Instructor Dr. Gurdip Singh – 234 Nichols Hall –
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
B.Ramamurthy9/19/20151 Operating Systems u Bina Ramamurthy CS421.
Invariant Based Programming in Education Tutorial, FM’08 Linda Mannila
1 Software Testing and Quality Assurance Lecture 33 – Software Quality Assurance.
Lecture 0 Anish Arora CSE 6333 Introduction to Distributed Computing.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Scientific Computing By: Fatima Hallak To: Dr. Guy Tel-Zur.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
Parallel Distributed Programming Introduction.
Early Adopter: Integrating Concepts from Parallel and Distributed Computing into the Undergraduate Curriculum Eileen Kraemer Computer Science Department.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Framework for the Development and Testing of Dependable and Safety-Critical Systems IKTA 065/ Supported by the Information and Communication.
Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.
Motions for Permanent Undergraduate Course Numbers Brian L. Evans On Behalf of the ECE Curriculum Committee September 21, 2015.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
A Summary of the Distributed System Concepts and Architectures Gayathri V.R. Kunapuli
CIS 842: Specification and Verification of Reactive Systems Lecture 1: Course Overview Copyright 2001, Matt Dwyer, John Hatcliff, and Radu Iosif. The.
CprE 458/558: Real-Time Systems
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
An Undergraduate Course on Software Bug Detection Tools and Techniques Eric Larson Seattle University March 3, 2006.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
 Course Overview Distributed Systems IT332. Course Description  The course introduces the main principles underlying distributed systems: processes,
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
HACNet Simulation-based Validation of Security Protocols Vinay Venkataraghavan Advisors: S.Nair, P.-M. Seidel HACNet Lab Computer Science and Engineering.
Several sets of slides by Prof. Jennifer Welch will be used in this course. The slides are mostly identical to her slides, with some minor changes. Set.
CS 542: Topics in Distributed Systems Self-Stabilization.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Software Systems Verification and Validation Laboratory Assignment 4 Model checking Assignment date: Lab 4 Delivery date: Lab 4, 5.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Software Fault-Tolerance Ali Ebnenasir Computer Science and Engineering Department Michigan State University U.S.A.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Advanced Operating Systems CIS 720
Martin Casado, Nate Foster, and Arjun Guha CACM, October 2014
CSC Classes Required for TCC CS Degree
Mark McKelvin EE249 Embedded System Design December 03, 2002
Abstractions for Fault Tolerance
Presentation transcript:

Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department of Computer Science Michigan Technological University Houghton MI 49931, USA

Problem Teaching Parallel and Distributed Computing (PDC) is hard Even more challenging when teaching how PDC should be designed in the presence of faults

Objective Educate undergraduate students on the design, verification, and implementation of Fault-Tolerant Parallel and Distributed Computing (FTPDC).

Outline Three courses – Two core courses (geared towards design and verification) Model-Driven Software Development (CS4710) Software Quality Assurance (CS4712) – One elective (focused on implementation) Operating Systems (CS4411) Contributions: – Three course modules including lectures, demonstrations and assignments

DESIGN AND VERIFICATION MODULES

Courses in This Module Model-Driven Software Development (CS4710) – Lectures – Demos – Assignments Software Quality Assurance (CS4712) – Lectures – Demos – Assignments

CS4710 Lectures Models: basic concepts of parallel and distributed architectures (e.g., physical shared memory multiprocessors and computer clusters) – Inter-process communication mechanisms – Non-determinism – Fairness. Formal Representation: an automata-theoretic formalization of the basic concepts of PDC Formal Modeling in Promela: – The Process Meta Language (Promela) is the modeling language of the SPIN model checker – refines concepts in Promela

CS4710 Lectures - Continued Properties and Specifications: specify system requirements – in Hoare logic (e.g., pre-post conditions, loop invariants) – In Temporal logic (e.g., temporal operators “always” and “eventually” for safety and liveness specification) Concurrency in Promela: necessary skills for modeling and analyzing concurrency in – The shared memory model, and – The message-passing model (synchronous and asynchronous)

CS4710 Lectures - Continued Fault Modeling: introduces different types of faults (e.g., transient, permanent) – Model and analyze the impact of faults on system executions – Notions of Fault, Error and Failure (FEF) and their interdependencies. Levels of Fault tolerance: define three levels depending on the extent to which safety and liveness are met – Failsafe: always satisfy safety, even when faults occur – Nonmasking: guarantee recovery; safety is not a priority – Masking: both failsafe and nonmasking

CS4712 Lectures Self-Stabilization: basics of designing and verifying self-stabilization Design and Verification of Self-Stabilization: history of existing methods for design and verification of self-stabilizing system Algorithmic Design of Self-Stabilization: a preliminary background on automated design of stabilization using examples and tools developed by the Software Design Group ( at Michigan Tech

Assignments CS4710 – Non-determinism: how non-determinism could result in different paths of execution – Fairness: how fairness can simplify design – Safety and liveness: how to specify safety/liveness requirements in LTL and how to verify them in SPIN – Fault modeling: How to model faults in Promela – Levels of fault tolerance: How to verify fault tolerance CS4712 – How to test that a program is correct from any initial state? A self-stabilizing game Maximal matching network protocol that provides a matching from any arbitrary configuration

Demonstrations Used the model checker SPIN to give demos on the basic concepts – Non-determinism – Fairness – Inter-process communication in shared memory and message-passing models – Impact of faults – Fault tolerance

IMPLEMENTATION MODULE

Operating Systems Lectures Introduction to Distributed Systems – Models, faults, consistent global states – Consensus in a synchronous system with fail-stop failures Common Approaches – Isolation: approaches, VM example – Restart: checkpointing, rollback in distributed systems – Replication: approaches, Triple Modular Redundancy, stateless service and idempotent operations Reliability Block Diagrams – Constant component reliability

OS Assignments Checkpoint – Students already complete Nachos projects – Added implementation of checkpoint system call and test Reliability Block Diagrams – Traditional problems; developed RBDTool for demonstration and exploration – Source (and binaries for Windows, MacOS, Linux) at: Shared Memory Server – Support operations against a shared memory – Server must survive one crash failure

OS Experience Nachos checkpoint assignment very effective – Understanding return into ckpt required “breakthrough” – Enhanced understanding of binary format, process state, difficulty of process migration Difficult to fit topics into the course – Insufficient time for other meaningful programming assignments – Familiarity with reliability calculations; only simple systems Otherwise OS was a good fit – Enhanced coverage of RAID, virtual machines and implementation of processes

Conclusions Considered this pilot project a success All curriculum revisions will be retained Considering how to extend, reorganize the material – Design and implementation of modules for other courses: Team Software Project, Parallel Programming and Concurrent Computing – New course on modeling and analyzing fault-tolerant parallel/distributed computing systems