Lessons from The File Copy Assignment

Slides:



Advertisements
Similar presentations
Chapter 5 Errors Bjarne Stroustrup
Advertisements

RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
1 Design by Contract Building Reliable Software. 2 Software Correctness Correctness is a relative notion  A program is correct with respect to its specification.
1 Basic Definitions: Testing What is software testing? Running a program In order to find faults a.k.a. defects a.k.a. errors a.k.a. flaws a.k.a. faults.
API Design CPSC 315 – Programming Studio Fall 2008 Follows Kernighan and Pike, The Practice of Programming and Joshua Bloch’s Library-Centric Software.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Debugging CPSC 315 – Programming Studio Fall 2008.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
CS 603 Communication and Distributed Systems April 15, 2002.
1 Security and Software Engineering Steven M. Bellovin AT&T Labs – Research
Advanced Java Course Exception Handling. Throwables Class Throwable has two subclasses: –Error So bad that you never even think about trying to catch.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
TCP Sockets Reliable Communication. TCP As mentioned before, TCP sits on top of other layers (IP, hardware) and implements Reliability In-order delivery.
Group practice in problem design and problem solving
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
RAID Shuli Han COSC 573 Presentation.
COMPUTER PROGRAMMING 2 Exceptions. What are Exceptions? Unexpected events that happen when the code is executing (during runtime). Exceptions are types.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
SWE 619 © Paul Ammann Procedural Abstraction and Design by Contract Paul Ammann Information & Software Engineering SWE 619 Software Construction cs.gmu.edu/~pammann/
Reliability and Recovery CS Introduction to Operating Systems.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Copyright 2012, 2013 & 2015 – Noah Mendelsohn COMP 150-IDS Ping Demonstration Programs & Makefile / C++ Hints Noah Mendelsohn Tufts University
ANU COMP2110 Software Design in 2003 Lecture 10Slide 1 COMP2110 Software Design in 2004 Lecture 12 Documenting Detailed Design How to write down detailed.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:
End-to-End Arguments in System Design CSCI 634, Fall 2010.
Noah Mendelsohn Visiting Scholar Operating Systems, Distributed systems, & Web.
Disk Failures Skip. Index 13.4 Disk Failures Intermittent Failures Organizing Data by Cylinders Stable Storage Error- Handling.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Introduction to: The Architecture of the Internet
Hardware & Software Reliability
Faults and fault-tolerance
Types for Programs and Proofs
A Brief Intro to the RPC Project Framework
POPULAR POWER Security Issues of Peer-to-Peer Systems
COMP 117 Ping Demonstration Programs & Makefile / C++ Hints
Lecture 2 of Computer Science II
Application Development Theory
Pseudo-ops, Debugging, etc.
Cracking the Coding Interview
Introduction to: The Architecture of the Internet
CSCE 315 – Programming Studio, Fall 2017 Tanzir Ahmed
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Introduction to: The Architecture of the Internet
Introduction to Fault Tolerance
CSCE 489- Problem Solving Programming Strategies Spring 2018
RAID Redundant Array of Inexpensive (Independent) Disks
UNIT IV RAID.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Exception Handling Imran Rashid CTO at ManiWeber Technologies.
Introduction to: The Architecture of the Internet
Designing For Testability
EEC 688/788 Secure and Dependable Computing
COMP 150-IDS Ping Demonstration Programs & Makefile / C++ Hints
Computer Science 340 Software Design & Testing
CSE 542: Operating Systems
Error Handling in Java Servlets
Abstractions for Fault Tolerance
Presentation transcript:

Lessons from The File Copy Assignment COMP 150-IDS: Internet Scale Distributed Systems (Fall 2012) Lessons from The File Copy Assignment Noah Mendelsohn Tufts University Email: noah@cs.tufts.edu Web: http://www.cs.tufts.edu/~noah

General Many students found it harder than they guessed it would be Early returns: some students did very well So far, nobody handled read nastiness at all Kicks in for filenastiness >= 4 There are some interesting principles illustrated by this assignment…we’ll discuss them briefly here

Non-determinism

Non-determinism Most of the programs you’ve written do the same thing for any given input The file copy assignment is non-deterministic because: Timing dependencies (races) between client and server Vagaries of network delivery Nastiness induced by the framework Non-deterministic programming is a skill you can practice…it’s inherently more difficult Transient failures are hard to debug Demonstrating correctness is hard The solution: clean abstractions with strong invariants you can reason about! Even with those, it’s still hard!

Fail Stop -- A Good Abstraction for Dealing with Hardware and Network Errors

What makes File Copy hard? You don’t know when errors have happened Wouldn’t it be nice to have a system that never did any thing wrong? You actually count on your computer to do this: How bad would it be if you 2+2 = 5 and the computer quietly kept going? Your computer has Error Correcting Codes (ECC), and other cross-checks to give you this property You can achieve similar effects in your software designs

Fail Stop – an important abstraction for error handling Fail stop: never quietly produce an incorrect result… …when you can’t report errors explicitly, guaranty to stop in case of trouble Fail stop is a terrific abstraction for building reliable systems Build from parts that stop on in case of error Recover at a higher level, by retrying or using redundant parts In many cases, forcing fail-stop (or reporting errors explicitly) can make your code cleaner: Eg. Do checksums and fail the operation in case of bad checksum Do an operation repeatedly or in parallel, and fail if not all results agree (More complex systems do voting) Could you handle read nastiness better if you simulated fail-stop? Never return a block to the rest of your program unless you have high confidence it’s correct