Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin
Salus’ goal Usage: – Provide remote disks to users (Amazon EBS) Scalability – Thousands of machines Robustness – Tolerate disk/memory corruption, CPU errors, … Good performance
Scalability and robustness Operating System Distributed Protocol BigTable: 1 corruption/5TB of data?
Challenge: Parallelism vs Consistency Metadata server Storage servers Clients Infrequent metadata transfer Parallel data transfer Data is replicated for durability and availability State-of-the-art architecture: GFS/Bigtable, HDFS/HBase, WAS, …
Challenges Write in parallel and in order Eliminate single point of failures – Prevent a single node from corrupting data – Read safely from one node Do not increase replication cost
Write in parallel and in order Metadata server Data servers Clients
Write in parallel and in order Write 1Write 2 Write 2 is committed but write 1 is not. Not allowed for block store.
Prevent a single node from corrupting data Metadata server Data servers Clients
Prevent a single node from corrupting data Tasks of computation nodes: – Data forwarding, garbage collection, etc Examples of computation nodes: – Tablet server (Bigtable), Region server (HBase), … (WAS) Computation node
Read safely from one node Read is executed on one node: – Maximize parallelism – Minimize latency If that node experiences corruptions, …
Do not increase replication cost Industrial systems: – Write to f+1 nodes and read from one node BFT systems: – Write to 3f+1 nodes and read from 2f+1 nodes
Salus’ approach Start from a scalable architecture (Bigtable/HBase) Ensure robustness techniques do not hurt scalability
Salus’ key ideas Pipelined commit – Guarantee ordering despite parallel writes Active storage – Prevent a computation node from corrupting data End-to-end verification – Read safely from one node
Salus’ key ideas Metadata server Clients Pipelined commit Active storage End-to-end verification
Pipelined commit Goal: barrier semantic – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – The client blocks at a barrier: lose parallelism A weaker version of distributed transaction – Well-known solution: two phase commit (2PC)
Pipelined commit – 2PC Previous leader Prepared Committed Client Servers Leader Prepared Leader Batch i Batch i+1
Pipelined commit – 2PC Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1
Pipelined commit - challenge Is 2PC slow? – Additional network messages? Disk is the bottleneck. – Additional disk write? Let’s eliminate that. – Challenge: whether to commit a write after recovery is prepared. Should it be committed? Both cases are possible. Salus’ solution: ask other nodes – Has anyone committed 3 or larger? If not, is 1 committed?
Active Storage Goal: a single node cannot corrupt data Well-known solution: replication – Problem: replication cost vs availablity Salus’ solution: use f+1 replicas – Require unanimous consent of the whole quorum – If one replica fails, replace the whole quorum
Active Storage Computation node Storage nodes
Active Storage Computation nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computation nodes. Additional benefit: reduce network bandwidth usage
Active Storage Computation nodes Storage nodes What if one computation node fails? – Problem: we may not know which one is faulty. Replace the whole quorum
Active Storage Computation nodes Storage nodes What if one computation node fails? – Problem: we may not know which one is faulty. Replace the whole quorum – The new quorum must agree on the states.
Active Storage Does it provide BFT with f+1 replication? No …. During recovery, may accept stale states if: – The client fails; – At least one storage node provides stale states; – All other storage nodes are not available. 2f+1 replicas can eliminate this case: – Is it worth adding f replicas to eliminate that?
End-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers
End-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers.
Recovery Pipelined commit – How to ensure write order after recovery? Active storage: – How to agree on the current states? End-to-end verification – How to rebuild Merkle tree if client recovers?
Discussion – why HBase? It’s a popular architecture – Bigtable: Google – HBase: Facebook, Yahoo, … – Windows Azure Storage: Microsoft It’s open source. Why two layers? – Necessary if storage layer is append-only Why append-only storage layer? – Better random write performance – Easy to scale
Discussion – multiple writers?
Lessons Strong checking makes debugging easier.
Evaluation
Challenge: Combining robustness and scalability Scalable systems (GFS/Bigtable, HDFS/HBase, WAS, Spanner, FDS, …..) Strong protections (End-to- end checks, BFT, Depot, …) Combining them is challenging.