Carnegie Mellon December 2005 SRS Principal Investigator Meeting Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter Natassa.

Slides:



Advertisements
Similar presentations
Henry C. H. Chen and Patrick P. C. Lee
Advertisements

The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Carnegie Mellon Approved for Public Release, Distribution Unlimited Increasing Intrusion Tolerance Via Scalable Redundancy Michael Reiter
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
TRUST Spring Conference, April 2-3, 2008 Write Markers for Probabilistic Quorum Systems Michael Merideth, Carnegie Mellon University Michael Reiter, University.
Copyright © 2001 Qusay H. Mahmoud RMI – Remote Method Invocation Introduction What is RMI? RMI System Architecture How does RMI work? Distributed Garbage.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Federated Array of Bricks Y Saito et al HP Labs CS 6464 Presented by Avinash Kulkarni.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
BASE: Using Abstraction to Improve Fault Tolerance Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov MIT Laboratory for Computer Science and Microsoft.
Wide-area cooperative storage with CFS
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
File Systems (2). Readings r Silbershatz et al: 11.8.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems James Cowling 1, Daniel Myers 1, Barbara Liskov 1 Rodrigo Rodrigues 2, Liuba.
Cassandra - A Decentralized Structured Storage System
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Serverless Network File Systems Overview by Joseph Thompson.
Agile Survivable Store PIs: Mustaque Ahamad, Douglas M. Blough, Wenke Lee and H.Venkateswaran PhD Students: Prahlad Fogla, Lei Kong, Subbu Lakshmanan,
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Byzantine fault tolerance
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Example: Rumor Performance Evaluation Andy Wang CIS 5930 Computer Systems Performance Analysis.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Carnegie Mellon Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Carnegie Mellon Increasing Intrusion Tolerance Via Scalable Redundancy Greg Ganger Natassa9 Ailamaki Mike Reiter Priya Narasimhan Chuck.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Distributed Storage Systems: Data Replication using Quorums.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Chapter Five Distributed file systems. 2 Contents Distributed file system design Distributed file system implementation Trends in distributed file systems.
Distributed Computing & Embedded Systems Chapter 4: Remote Method Invocation Dr. Umair Ali Khan.
BChain: High-Throughput BFT Protocols
Google File System.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Providing Secure Storage on the Internet
Principles of Computer Security
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
EEC 688/788 Secure and Dependable Computing
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
by Mikael Bjerga & Arne Lange
EEC 688/788 Secure and Dependable Computing
If my file system only has lots of big video files what block size do I want? Large (correct) Small.
Presentation transcript:

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Technical Objective To design, prototype and evaluate new protocols for implementing intrusion-tolerant services that scale better  Here, “scale” refers to efficiency as number of servers and number of failures tolerated grows Targeting three types of services  Read-write data objects  Custom “flat” object types for particular applications, notably directories for implementing an intrusion-tolerant file system  Arbitrary objects that support object nesting

Carnegie Mellon December 2005 SRS Principal Investigator Meeting The Problem Space Distributed services manage redundant state across servers to tolerate faults  We consider tolerance to Byzantine faults, as might result from an intrusion into a server or client  A faulty server or client may behave arbitrarily  We also make no timing assumptions in this work  An “asynchronous” system Primary existing practice: replicated state machines  Offers no load dispersion, requires data replication, and degrades as system scales in terms of # messages  When appropriate, we use Castro & Liskov’s BFT system to compare against

Carnegie Mellon December 2005 SRS Principal Investigator Meeting This Talk in Context January 2005 PI meeting  Focused on the basic R/W protocol July 2005 PI meeting  Focused on the Q/U protocol for implementing arbitrary “flat” objects This meeting  Discuss “lazy verification” extensions to R/W protocol  Discuss nested objects protocol

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Response time under load is fault-scalable Highlights: Read/Write Response Time 10 clients and up to 26 storage-nodes 2.8 GHz Pentium IV machines  Used as storage- nodes and clients 10 Clients, each with 2 reqs outstanding Mixed workload equal parts reads and writes 4 KB data-item size

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Highlights: The Q/U Protocol Working size of experiments fit in server memory Tests run for 30 seconds  Measurements taken in middle 10 Cluster of Pentium GHz, 1GB RAM 1 Gb switched Ethernet  No background traffic

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Read/Write Failure Scenarios Two types of failures  Incomplete writes  Client writes data to subset of servers  Poisonous writes  Client writes data inconsistently to servers –Subsequent reader observes different values depending on which subset of servers they interact with  Replicated data: easy to handle (via hashes)  Erasure-coded data: more difficult to handle Protocols must verify writes to protect against incomplete and poisonous writes.

Carnegie Mellon December 2005 SRS Principal Investigator Meeting The Nature of Write Operations… Insight for protocol design 1) Single data version forces write-time verification  Versioning servers remove destructive nature of writes 2) Obsolescent writes common in storage systems  Read-time verification avoids unnecessary verifications 3) Low concurrency in most workloads  Optimistic concurrency control

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Original Read/Write Protocol Use versioning servers  Frees servers from verifying every write at write-time  Read-time verification performed by clients  Better scalability  Avoids verification for obsolescent writes  Client read earlier versions in case of incomplete/poisonous writes Optimism premised on low faults/concurrency Support erasure codes, Byzantine-tolerant, async Linearizable read/write ops on blocks

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Example Write and Read Time Completes write 2 Return version 2 Write 1 Write 2 Read Completes write

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Tolerating Client Crashes Time Different versions Write version 2 to rest of servers Write 2 Read Repair Client crashes… 

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Erasure Coding Reed-Solomon / information dispersal [Rabin89] Each fragment object size Total amount of data written: n fragments (160KB) Object (64KB) m fragments (64KB) Object (64KB) Example: 2-of-5 erasure coding Encode Decode Write Read

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Tolerating Byzantine Servers: Cross Checksum Cross checksum for 2-of-3 erasure coding Hash each fragment Generate fragments Concatenate to form cross checksum Append to each fragment

Carnegie Mellon December 2005 SRS Principal Investigator Meeting 1 =≠ 0 Tolerating Byzantine Clients Generate parity of {1,0} “Poisonous” 2-of-3 erasure coding of {1,0}  1 0 Value read depends on the set of fragments decoded {1,0}{1,1} {0,0}

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Validating Timestamps Embed cross checksum in timestamp Each server validates its write fragment Client validates cross checksum on read Logical timestamp Generate fragment Generate cross checksum Read fragments Validate cross checksum ≠

Carnegie Mellon December 2005 SRS Principal Investigator Meeting How Do You Get Rid of Old Versions? Two more pieces to complete the picture  Garbage collection (GC)  Unbounded number of incomplete/poisonous writes

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Lazy Verification Overview Servers can perform verification, lazily, in idle time  Shifts verification cost out of read/write critical path  Allows servers to perform GC Per-client, per-block limits on unverified writes  Limits number of incomplete/poisonous writes Maintains good R/W properties  Optimism  Verification-elimination for obsolescent writes

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Periodically, every server…  Scans through all blocks…  Performs a read (acting like a normal client) –Discover latest complete write timestamp (LCWTS) –Reconstruct block to check for poisonous writes  Deletes all versions prior to LCWTS Basic Garbage Collection Verification

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Limiting Unverified Writes Admin can set limits on # of unverified writes  Per-client, per-block, and per-client-per-block Limit = 1  write-time verification Limit = ∞  read-time verification

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Scheduling Background  In idle time (hence the lazy in lazy verification) On-demand  Verification limits reached  Low free space in history pool (cache or disk)

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Block Selection If verification invoked due to exceeded limit  No choice; verify that (clients’) block Else  Verify block with most versions  Maximize the verification cost amortization  Prefer to verify blocks in cache  No unnecessary disk write  No read to start verification  No cleaning of on-disk version structures

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Server Cooperation Simple: every server independently verifies every block ~n 2 messages Read request Read reply

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Server Cooperation (con’t) Cooperation: b +1 servers perform verification, share result ~b · n messages b = 1 Verification hint Read request Read reply

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Experimental Setup 2.8 GHz Pentium 4 machines  Used as servers and clients 1 Gb switched Ethernet, no background traffic In-cache only (to evaluate protocol cost) 16KB blocks Vary number of server Byzantine failures ( b )  n = 4b + 1 ( b +1)-of- n encoding  Maximal storage- and network-efficiency

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Response Time Experiment 1 client, 1 outstanding request Vary b from 1 to 5, to investigate changes in response times as we tolerate more server failures Alternate between reads and writes Idle time: 10ms between operations  Allows verification to occur in background

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Response Time

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Response Time

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Response Time

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Read Response Time

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Read Response Time

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Read Response Time

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Throughput b = 1  N = 5 4 clients, 8 outstanding requests each  No idle time Server working set: 4096 blocks (64MB) 100% writes In-cache only  Full history pool triggers lazy verification Vary the server history pool size to see effect of delaying verification

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Throughput (con’t)

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Throughput (con’t)

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Throughput (con’t)

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Write Throughput (con’t)

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Nested Objects Goal: support nested method invocations among Byzantine fault- tolerant, replicated objects that are accessed via quorum systems  Semantics and programmer interface modeled after Java Remote Method Invocation (RMI) [  Distributed objects can be  Passed as parameters to method calls on other distributed objects  Returned from method calls on other distributed objects

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Java Remote Method Invocation (RMI) Standard Java mechanism to invoke methods on objects in other JVMs Local interactions are with a handle that implements interfaces of remote object invocation response handleremote object local client remote server

Carnegie Mellon December 2005 SRS Principal Investigator Meeting RMI: Nested Method Invocations Handles can be passed as parameters into method invocations on other remote objects A method invocation on one remote object could result in a method invocation on other remote objects

Carnegie Mellon December 2005 SRS Principal Investigator Meeting RMI: Handle Returned Handles can be returned from method invocations on other remote objects

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Replicated Objects Replicas behave as a single logical object Can withstand the Byzantine (arbitrary) failure of up to b servers Scales linearly with number of servers 1 2 ABCD handle replicas

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Quorum Systems Given a universe of n servers Quorum system is a set of subsets ( quorums ) of the universe, every pair of which intersect Scales well as a function of n, as quorum size can be significantly smaller than n Ex: Grid with n=144 1 quorum = 1 row + 1 column q1q1 q1q1 q2q2 q2q2

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Byzantine Quorum Systems Extend quorum systems to withstand the Byzantine failure of up to b servers Every pair of quorums intersect in >= 2 b +1 servers (>= b +1 correct servers) A new quorum must be selected if a response is not received from every server in a quorum Ex: Grid with n=144, b=3 1 quorum = 2 rows + 2 columns q1q1 q1q1 q2q2 q2q2

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Byzantine Quorum-Replicated Objects Method invocations sent to a quorum >= b+1 identical responses must be correct 1 2 ABCD handle replicas

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Nested Method Invocations ABCD Handles can be passed as parameters into method invocations on other distributed objects

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Handle Returned ABCD Handles can be returned from method invocations on other distributed objects

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Necessity of Authorization ABCD Faulty replicas can invoke unauthorized methods Correct replicas might perform duplicate invocations

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Authorization Framework Requirements Method invocation authority can be delegated  Explicitly to other clients  Implicitly to other distributed objects  Handle passed as a parameter to a method invocation on a second object  Handle returned to a method invocation from a second object  Support arbitrary nesting depths

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Authorization Framework ABCD 123 says (b+1 of {,,, } speak for ) 1234 (, ) = private/public key pair i = private key for i = certificate:

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Operation Ordering Protocol Worst-case 4-round protocol  Get  Suggest  Propose  Commit Extends protocol previously used in Fleet [Chockler et al. 2001] Operations are applied in batches, increasing throughput

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Operation Ordering Protocol - Client Side Fundamental challenge is the absence of a single trusted client  A trusted client could order all operations Instead, a single untrusted client replica drives the protocol Driving client:  Acts as a point of centralization to distribute authenticated server messages  Makes no protocol decisions  Is unable to cause correct servers to take conflicting actions  Can be unilaterally replaced by another client replica when necessary

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Experimental Setup Implemented object nesting as an extension of Fleet Pentium 4 2.8GHz processors 1000Mbps Ethernet (TCP, not multicast) Linux Java HotSpot™ Server VM Native Crypto++ Library for key generation, signing, and verification [

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Latency for Non-Nested Invocation

Carnegie Mellon December 2005 SRS Principal Investigator Meeting A Real Byzantine Fault

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Impediments to Dramatic Increases Impossibility results  Load dispersion across quorums  Round complexity of protocols Strong consistency conditions  Weakening consistency is one place to look for big improvements