Lessons from The File Copy Assignment COMP 150-IDS: Internet Scale Distributed Systems (Fall 2012) Lessons from The File Copy Assignment Noah Mendelsohn Tufts University Email: noah@cs.tufts.edu Web: http://www.cs.tufts.edu/~noah
General Many students found it harder than they guessed it would be Early returns: some students did very well So far, nobody handled read nastiness at all Kicks in for filenastiness >= 4 There are some interesting principles illustrated by this assignment…we’ll discuss them briefly here
Non-determinism
Non-determinism Most of the programs you’ve written do the same thing for any given input The file copy assignment is non-deterministic because: Timing dependencies (races) between client and server Vagaries of network delivery Nastiness induced by the framework Non-deterministic programming is a skill you can practice…it’s inherently more difficult Transient failures are hard to debug Demonstrating correctness is hard The solution: clean abstractions with strong invariants you can reason about! Even with those, it’s still hard!
Fail Stop -- A Good Abstraction for Dealing with Hardware and Network Errors
What makes File Copy hard? You don’t know when errors have happened Wouldn’t it be nice to have a system that never did any thing wrong? You actually count on your computer to do this: How bad would it be if you 2+2 = 5 and the computer quietly kept going? Your computer has Error Correcting Codes (ECC), and other cross-checks to give you this property You can achieve similar effects in your software designs
Fail Stop – an important abstraction for error handling Fail stop: never quietly produce an incorrect result… …when you can’t report errors explicitly, guaranty to stop in case of trouble Fail stop is a terrific abstraction for building reliable systems Build from parts that stop on in case of error Recover at a higher level, by retrying or using redundant parts In many cases, forcing fail-stop (or reporting errors explicitly) can make your code cleaner: Eg. Do checksums and fail the operation in case of bad checksum Do an operation repeatedly or in parallel, and fail if not all results agree (More complex systems do voting) Could you handle read nastiness better if you simulated fail-stop? Never return a block to the rest of your program unless you have high confidence it’s correct