Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin 1
Storage: do not lose data Fsck, IRON, ZFS, … Past Now & Future Are they robust enough? 2
Salus: provide robustness to scalable systems Scalable systems: crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, FDS, …) Strong protections: arbitrary failure (BFT, End-to-end verification, …) Salus 3
Start from “Scalable” 4 File system Key value Database Block store Scalable Robust
Scalable architecture Metadata server Data servers Clients Infrequent metadata transfer Parallel data transfer Data is replicated for durability and availability Computation nodes: No persistent storage Data forwarding, garbage collection, etc Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), … 5
Challenges ? Parallelism vs consistency Prevent computation nodes from corrupting data Read safely from one node 6 Do NOT increase replication cost
Parallelism vs Consistency Metadata server Data servers Clients Write 1 Write 2 Barrier Block store requires barrier semantic. 7
Prevent computation nodes from corrupting data Metadata server Data servers Clients Single point of failure 8
Read safely from one node Metadata server Data servers Clients Single point of failure Does simple checksum work? Can not prevent errors in control flow. 9
Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 10
Salus’ overview Disk-like interface (Amazon EBS): – Volume: a fixed number of fixed-size blocks – Single client per volume: barrier semantic Failure model: arbitrary but no identity forge Key ideas: – Pipelined commit – write in parallel and in order – Active storage – safe write – Scalable end-to-end checks – safe read 11
Salus’ key ideas Metadata server Clients Pipelined commit Active storage End-to-end verification 12
Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 13
Pipelined commit Goal: barrier semantic – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – The client blocks at a barrier: lose parallelism A weaker version of distributed transaction – Well-known solution: two phase commit (2PC) 14
Pipelined commit – 2PC Previous leader Prepared Batch i-1 committed Client Servers Leader Prepared Leader Batch i Batch i
Pipelined commit – 2PC Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1 Parallel data transfer Sequential commit 16
Key difference from 2PC Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1 17
Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 18
Salus: active storage Goal: a single node cannot corrupt data Well-known solution: BFT replication – Problem: higher replication cost Salus’ solution: use f+1 computation nodes – Require unanimous consent – How about availability/liveness? 19
Active storage: unanimous consent Computation nodes Storage nodes 20
Active storage: unanimous consent Computation nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computation nodes. Additional benefit: – Collocate computation and storage: save network bw 21
Active storage: restore availability Computation nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computation nodes 22
Active storage: restore availability Computation nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computation nodes – Computation nodes do not have persistent states. 23
Discussion Does Salus achieve BFT with f+1 replication? – No. Fork can happen in a combination of failures. Is it worth using 2f+1 replicas to eliminate this case? – No. Fork can still happen as a result of geo-replication. How to support multiple writers? – We do not know a perfect solution. – Give up either linearizability or scalability. 24
Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 25
Evaluation 26
Evaluation 27
Evaluation 28
Outline Salus’ overview Pipelined commit – Write in order and in parallel Active storage: – Prevent a single node from corrupting data Scalable end-to-end verification: – Read safely from one node Evaluation 29
Salus: scalable end-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers 30
Scalable end-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers. 31
Scalable and robust storage More hardwareMore complex softwareMore failures 32
Start from “Scalable” Scalable Robust 33
Pipelined commit - challenge Is 2PC slow? – Locking? Single writer, no locking – Phase 2 network overhead? Small compare to Phase 1. – Phase 2 disk overhead? Eliminate that, push complexity to recovery Order across transactions – Easy when failure-free. – Complicate recovery. Key challenge: Recovery 34