Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.

Slides:

Advertisements

Similar presentations

What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.

Advertisements

High throughput chain replication for read-mostly workloads

Henry C. H. Chen and Patrick P. C. Lee

The google file system Cs 595 Lecture 9.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.

Carnegie Mellon Approved for Public Release, Distribution Unlimited Increasing Intrusion Tolerance Via Scalable Redundancy Michael Reiter

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.

Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.

Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte.

The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.

Carnegie Mellon December 2005 SRS Principal Investigator Meeting Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter Natassa.

DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)

TRUST Spring Conference, April 2-3, 2008 Write Markers for Probabilistic Quorum Systems Michael Merideth, Carnegie Mellon University Michael Reiter, University.

1 Attested Append-Only Memory: Making Adversaries Stick to their Word Byung-Gon Chun (ICSI) October 15, 2007 Joint work with Petros Maniatis (Intel Research,

DStore: Recovery-friendly, self-managing clustered hash table Andy Huang and Armando Fox Stanford University.

1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

On-The-Fly Verification of Rateless Erasure Codes Max Krohn (MIT CSAIL) Michael Freedman and David Mazières (NYU)

Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley

Byzantine fault tolerance

Efficient Proactive Security for Sensitive Data Storage Arun Subbiah Douglas M. Blough School of ECE, Georgia Tech {arun,

Secure File Storage Nathanael Paul CRyptography Applications Bistro March 25, 2004.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

RAMCloud Design Review Recovery Ryan Stutsman April 1,

©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000.

Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.

Fault Tolerance via the State Machine Replication Approach Favian Contreras.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.

Project Presentation Students: Yan Michalevsky Asaf Cidon Supervisors: Alexander Shraer Assoc. Prof. Idit Keidar.

Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems James Cowling 1, Daniel Myers 1, Barbara Liskov 1 Rodrigo Rodrigues 2, Liuba.

Strong Security for Distributed File Systems Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.

Practical Byzantine Fault Tolerance

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Intrusion Tolerant Software Architectures Bruno Dutertre, Valentin Crettaz, Victoria Stavridou System Design Laboratory, SRI International

1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.

Byzantine fault tolerance

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Carnegie Mellon Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

Carnegie Mellon Increasing Intrusion Tolerance Via Scalable Redundancy Greg Ganger Natassa9 Ailamaki Mike Reiter Priya Narasimhan Chuck.

Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Data Integrity Proofs in Cloud Storage Author: Sravan Kumar R and Ashutosh Saxena. Source: The Third International Conference on Communication Systems.

POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.

Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Database Laboratory Regular Seminar TaeHoon Kim Article.

A Tale of Two Erasure Codes in HDFS

Slide credits: Thomas Kao

Presented by Haoran Wang

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

Providing Secure Storage on the Internet

OceanStore: An Architecture for Global-Scale Persistent Storage

Prophecy: Using History for High-Throughput Fault Tolerance

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

The SMART Way to Migrate Replicated Stateful Services

Sisi Duan Assistant Professor Information Systems

Presentation transcript:

Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina at Chapel Hill

James Hendricks, 2 Motivation As systems grow in size and complexity… Must tolerate more faults, more types of faults Modern storage systems take ad-hoc approach Not clear which faults to tolerate Instead: tolerate arbitrary (Byzantine) faults But, Byzantine fault-tolerance = expensive? Fast reads, slow large writes

James Hendricks, 3 Write bandwidth Number of faults tolerated (f) Crash fault-tolerant erasure-coded storage (non-Byzantine) Replicated Byzantine fault-tolerant storage Bandwidth (MB/s) Low-overhead erasure-coded Byzantine fault-tolerant storage f +1-of-2f +1 erasure-coded f +1 replicas 3f +1 replicas

James Hendricks, 4 Summary of Results We present a low overhead Byzantine fault- tolerant erasure-coded block storage protocol Write overhead: 2-round + crypto. checksum Read overhead: cryptographic checksum Performance of our Byzantine-tolerant protocol nearly matches that of protocol that tolerates only crashes Within 10% for large enough requests

James Hendricks, 5 B Erasure codes An m-of-n erasure code encodes block B into n fragments, each size |B|/m, such that any m fragments can be used to reconstruct block B d1d1 d2d2 d3d3 d4d4 d5d5 3-of-5 erasure code d 1, …, d 5 ← encode(B) Read d 1, d 2, d 3 B ← decode(d 1, d 2, d 3 ) Read d 1, d 3, d 5 B ← decode(d 1, d 3, d 5 )

James Hendricks, 6 Design of Our Protocol

James Hendricks, 7 Parameters and interface Parameters f : Number of faulty servers tolerated m ≥ f + 1: Fragments needed to decode block n = m + 2f ≥ 3f + 1: Number of servers Interface: Read and write fixed-size blocks Not a filesystem. No metadata. No locking. No access control. No support for variable- sized reads/writes. A building block for a filesystem

James Hendricks, 8 2-of-4 Erasure code Send fragment to each server Write protocol: prepare & commit B Response with cryptographic token Forward tokens Return status d3d3 d1d1 d2d2 d4d4 PrepareCommit

James Hendricks, 9 Read protocol: find & read B Request timestamps Propose timestamp Read at timestamp Return fragments d1d1 d2d2 Find timestampsRead at timestamp

James Hendricks, 10 Read protocol common case Request timestamps and optimistically read fragments Return fragments and timestamp B d1d1 d2d2

James Hendricks, 11 Issue 1 of 3: Wasteful encoding Erasure code m + 2 f Hash m + 2 f Send to each server But only wait for m + f responses  This is wasteful! d1d1 d2d2 d3d3 d4d4 B m: Fragments needed to decode f: Number of faults tolerated

James Hendricks, 12 Solution 1: Partial encoding Instead: -Erasure code m+f -Hash m+f -Hear from m+f Pro: Compute f fewer frags Con: Client may need to send entire block on failure -Should happen rarely d1d1 d2d2 d3d3 B

James Hendricks, 13 Issue 2: Block must be unique Fragments must comprise a unique block If not, different readers read different blocks Challenge: servers don’t see entire block Servers can’t verify hash of block Servers can’t verify encoding of block given hashes of fragments

James Hendricks, 14 Sol’n 2: Homomorphic fingerprinting Fragment is consistent with checksum if hash and homomorphic fingerprint [PODC07] match Key property: Block decoded from consistent fragments is unique hash4 hash3 hash2 hash1 fp2 fp1 d′ 4 hash4′ fp4 fp4′

James Hendricks, 15 Issue 3: Write ordering Reads must return most recently written block -Required for linearizability (atomic) -Faulty server may propose uncommitted write -Must be prevented. Prior approaches: 4f +1 servers, signatures, or 3+ round writes Our approach: -3f +1 servers, MACs, 2 round writes

James Hendricks, 16 Solution 3: Hashes of nonces Client Servers Prepare: Store hash(nonce) Return nonce nonce Collect nonces Commit: store nonces Find timestamps Return timestamp, nonces Return nonce_hash with fragment Read at timestamp Prepare nonces Compare hash(nonce) with nonce_hash Write Read

James Hendricks, 17 Bringing it all together: Write -Erasure code m+f fragments -Hash & fingerprint fragments -Send to first m+f servers -Verify hash, fingerprint -Choose nonce -Generate MAC -Store fragment -Forward MACs to servers -Verify MAC -Free older fragments -Write completed Overhead: Not in crash-only protocol didi Client Servers

James Hendricks, 18 Bringing it all together: Read -Request fragments from first m servers -Request latest nonce, timestamp, checksum -Return fragment (if requested) -Return latest nonce, timestamp, checksum -Verify provided checksum matches fragment hash&fp -Verify timestamps match -Verify nonces -Read complete didi Overhead: Not in crash-only protocol Client Servers

James Hendricks, 19 Evaluation

James Hendricks, 20 Experimental setup m = f + 1 Single client, NVRAM at servers Write or read 64 kB blocks -Fragment size decreases as f increases 3 GHz Pentium D, Intel PRO/1000 Fragments needed to decode block Number of faults tolerated

James Hendricks, 21 Prototype implementation Four protocols implemented: Our protocol Crash-only erasure-coded Crash-only replication-based P ASIS [Goodson04] emulation Read validation: Decode, encode, hash 4f+1 fragments 4f+1 servers, versioning, garbage collection All use same hashing and erasure coding libraries Byzantine tolerant Erasure coded

James Hendricks, 22 Write throughput Number of faults tolerated (f) Bandwidth (MB/s) Crash-only replication based (f +1 = m replicas) P ASIS -emulation Our protocol Crash-only erasure-coded

James Hendricks, 23 Write throughput Number of faults tolerated (f) Bandwidth (MB/s) Our protocol Crash-only erasure-coded 64 kB f kB fragments

James Hendricks, 24 Write response time Latency (ms) Number of faults tolerated (f ) D. Crash-only replication based C. P ASIS -emulation B. Our protocol A. Crash-only erasure-coded A B C D RPC Hash Encode Fingerprint Overhead breakdown (f =10)

James Hendricks, 25 Read throughput Bandwidth (MB/s) Number of faults tolerated (f) Crash-only replication based P ASIS -emulation (compute bound) Our protocol Crash-only erasure-coded

James Hendricks, 26 Read response time Latency (ms) Number of faults tolerated (f ) D. Replicated C. PASIS-emulation B. Our protocol A. Erasure-coded A B C D RPC Hash Encode Fingerprint Overhead breakdown (f =10)

James Hendricks, 27 Conclusions Byzantine fault-tolerant storage can rival crash-only storage performance We present a low overhead Byzantine fault- tolerant erasure-coded block storage protocol and prototype -Write overhead: 2-round, hash and fingerprint -Read overhead: hash and fingerprint -Close to performance of systems that tolerate only crashes for reads and large writes

James Hendricks, 28 Backup slides

James Hendricks, 29 Why not multicast? May be unstable or unavailable (UDP) More data means more work at server (network, disk, hash) Doesn’t scale with clients!

James Hendricks, 30 Cryptographic hash overhead Byzantine storage requires cryptographic hashing. Does this matter? Systems must tolerate non-crash faults E.g., “misdirected write” Many modern systems checksum data E.g., Google File System ZFS supports SHA-256 cryptographic hash function May hash data for authentication Conclusion: BFT may not introduce new hashing

James Hendricks, 31 Is 3f+1 servers expensive? Consider a typical storage cluster Usually more primary drives than parity drives Usually several hot spares Conclusion: May already use 3f +1 servers Primary drives Parity drives Hot spares f = number of faults tolerated