Predictable Computer Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison
Trends
Complexity Cheap Components Everything Interconnected
Problems
Nothing Works As Expected Performance Fault-Tolerance Security
What Would Be Ideal
Ideal Assemble large-scale system from cheap, complex components System works in predictable manner
Key: How Components Interact
State of the Art APIs Protocols
Beyond APIs and Protocols: Understanding “Behavior”
A Small Example: Understanding the Failure Behavior of Local File Systems
Understanding FS Failure Type-aware fault injection Make fault injection layer aware of FS structures e.g., make an inode block fail Why useful Can infer how file system reacts to failures at different points in its code
Write Errors: Recovery Techniques Ext3, JFS don’t react to write failures ReiserFS (almost) always calls panic() Zero Stop Propagate Retry Redundancy Recovery Ext3ReiserFSJFS
What We Need
Vocabulary + Techniques + Tools Methods to = Understand Behavior Predictable -> Computer Systems
CSI: Computer Systems Investigation
ADvanced Systems Lab (ADSL) Gray-box Operating Systems and Storage Systems Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau
ADvanced Systems Lab (ADSL) Who does the real work: Nitin Agrawal Lakshmi Bairavasundaram John Bent Nathan Burnett Tim Denehy Camille Fournier Haryadi Gunawi Todd Jones James Nugent Ina Popovici Vijayan Prabhakaran Muthian Sivathanu Who does the real work: Nitin Agrawal Lakshmi Bairavasundaram John Bent Nathan Burnett Tim Denehy Camille Fournier Haryadi Gunawi Todd Jones James Nugent Ina Popovici Vijayan Prabhakaran Muthian Sivathanu
Goal: Building Distributed Systems
Large-Scale Distributed Systems D D W W W Front Ends C C C DBMS Net Online StorageArchival Storage Internet Clients
Ideal: Legos Top Side What You See Is What You Get