Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter
19/09/2015Corinna Richter: Failure Resilience2 Outline Introduction An Overview over OceanStore Failure Resilience in OceanStore Byzantine Fault Protocol Proactive Threshold Signatures Erasure Coding Summary Questions
19/09/2015Corinna Richter: Failure Resilience3 Introduction Failure Resilience: “A system responds according to the specification in spite of a limited number of faults” Availibility Reliablitiy How does this work in open Peer-to-Peer- Systems? Specific problems Solutions in OceanStore
19/09/2015Corinna Richter: Failure Resilience4 OceanStore: Basics Archival Storage Client Inner ring Replicas Client Archival Storage Quelle: John Kubiatowicz “Internet-scale, persistent data store designed for incremental scalability, secure sharing and long-term durability” infrastructure is constantly changing and untrusted except in aggregate
19/09/2015Corinna Richter: Failure Resilience5 OceanStore: Inner Ring Primary replica for one data-object serializes update actions for this object checks the correctness of the update “knows” the current version of the object implemented by a group of servers : distributed load no „single point of failure“ What about correct decisions, if some hosts are faulty?
19/09/2015Corinna Richter: Failure Resilience6 Byzantine Fault Protocol - Problem Byzantine faults vs. Fail-Stop-Processes Fail-Stop: Omission, Crash no reaction Byzantine faults reaction might be faulty How many faulty processes are tolerable? How can all correct processes (of the primary replica) find the same decision? Illustration: The Byzantine Generals Problem
19/09/2015Corinna Richter: Failure Resilience7 Byzantine Fault Problem - Model There is only a solution of the BFP-Problem if less than one third of the processes are faulty! Commander P2P1 Go! Stop! Commander P2P1 Go!Stop! Who is the traitor?
19/09/2015Corinna Richter: Failure Resilience8 Byzantine Fault Problem - A “proof” by intuition Primary Replica f=3, n=? Client 3 answers may be delayed and faulty He can’t wait for more than n-3 messages. 3 of n-3 messages may still be faulty must have (n-3)-3 > 3 n > 9 Update X
19/09/2015Corinna Richter: Failure Resilience9 Byzantine Fault Protocol Ex.:order of updates - position of update X? P1 P2 P4 i i i P3 Round 1: P1 sends his decision to n-1 processes After round f+1 P2:(i, i, k) => i P3:(i, i, k) => i Round 2: each of the n-1 processes sends value he received to n-2 processes Round i: use the majority of round i-1 k ik i i
19/09/2015Corinna Richter: Failure Resilience10 Byzantine Fault Problem - Solution in OceanStore How can a system guarantee this? other systems: Reboot of a secure partition at regular intervals OceanStore: dynamically exchange the Server of the inner ring Responsible Party chooses the hosts for the inner ring analyses the stability of the hosts more Responsible Parties in a system
19/09/2015Corinna Richter: Failure Resilience11 BFP with signed messages: OceanStore Symmetric Keys vs. asymmetric Keys : MACs for the intern communication of the inner ring Public Key for the communication with others Proactive Threshold Signatures: One Public Key for all n hosts of the inner ring generate n=3f+1 private key shares
19/09/2015Corinna Richter: Failure Resilience12 Proactive Threshold Signatures - BFP in OceanStore f+1 private keys are combined to a full signature at most one of these messages comes from a correct host all correct hosts work deterministically Exchange of the server no interruption: public key stays unchanged generate new set of private key shares and delete the old set
19/09/2015Corinna Richter: Failure Resilience13 OceanStore: Update Primary Replica Write object x Other users Archival storage Secondary Dissemination Tree Quelle: S.Rhea, P. Eaton, D. Geels, H. Weatherspoon, B.Zhao, and J. Kubiatowicz
19/09/2015Corinna Richter: Failure Resilience14 Erasure Coding - Motivation Data availability must be guaranteed Omission of hosts, crashes, etc. Redundancy of the data replicated, distributed data storage on several hosts Problem of naive Replication inefficient with respect to the total storage consumed Erasure Coding
19/09/2015Corinna Richter: Failure Resilience15 Erasure Coding Idea: divide one block of data in m fragments and code these in n fragments (n>m). Distribute these n fragments arbitrarily on the hosts. m/n=r, Rate of encoding Storage costs multiplied by n/m Example: m=16, n=32, r=1/2, storage costs x 2 m=16 fragments Code them in 32 fragments on distributed servers
19/09/2015Corinna Richter: Failure Resilience16 Erasure Coding: Efficiency POND: Cauchy Reed Solomon Code with m = 16 and n = 32 The reconstruction of the data is possible with any m fragments complex algorithm for (de-) coding Data availibility is determined by possible permutations of the fragments increased by a factor of 4000 for n=32
19/09/2015Corinna Richter: Failure Resilience17 Erasure Coding: Disadvantages Primary Replica has to compute the coding and decoding of the fragments Very expensive operation! Just decode, if there is no secondary replica for this object whole block caching
19/09/2015Corinna Richter: Failure Resilience18 OceanStore: Dissemination Tree Tree-Structure for one data-object root: primary replica nodes: secondary replicas in cache publication of updates down the tree self-organising structure Primary Replica Secondary Replica.....
19/09/2015Corinna Richter: Failure Resilience19 Summary OceanStore:Internet-scale, global, persistent data store interesting solutions for failure resilience in peer-to-peer-systems Proactive Threshold Signatures Byzantine Fault Protocol Erasure Coding Results of a Prototype-Implementation Threshold Signatures not efficient to compute Further research based on OceanStore API
19/09/2015Corinna Richter: Failure Resilience20 Failure Resilience Questions?