Download presentation
Presentation is loading. Please wait.
Published byJerome King Modified over 9 years ago
1
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin 1
2
Storage: Do not lose data Fsck, IRON, ZFS, … Past Now & Future 2
3
Salus: Provide robustness to scalable systems Scalable systems: Crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, …) Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus 3
4
Start from “Scalable” 4 File system Key value Database Block store Scalable Robust Key value Borrow scalable designs
5
Scalable architecture Metadata server Data servers Clients Parallel data transfer Data is replicated for durability and availability 5 Computing node: No persistent storage Storage nodes
6
Problems ? No ordering guarantees for writes Single points of failures: – Computing nodes can corrupt data. – Client can accept corrupted data. 6
7
1. No ordering guarantees for writes Metadata server Data servers Clients Block store requires barrier semantics 7 123 12
8
2. Computing nodes can corrupt data Metadata server Data servers Clients Single point of failure 8
9
3. Client can accept corrupted data Metadata server Data servers Clients Single point of failure Does a simple disk checksum work? Can not prevent errors in memory, etc. 9
10
Salus: Overview Metadata server Block drivers (single driver per virtual disk) Pipelined commit Active storage End-to-end verification 10 Failure model: Almost arbitrary but no impersonation
11
Pipelined commit Goal: barrier semantics – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – Waits at a barrier: Lose parallelism Look similar to distributed transaction – Well-known solution: Two phase commit (2PC) 11
12
Problem of 2PC 123 4545 Prepared Driver Servers Leader Prepared Leader Batch i Batch i+1 13 2 12 12345 Barriers
13
Problem of 2PC 123 4545 Driver Servers Leader Batch i Batch i+1 13 2 13 Commit Problem can still happen. Need ordering guarantee between different batches.
14
Pipelined commit 123 45 13 2 45 Lead of batch i-1 Driver Servers Leader Commit Leader Batch i Batch i+1 Parallel data transfer 14 Batch i-1 committed Batch i committed Sequential commit
15
Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 15
16
Active storage Goal: a single node cannot corrupt data Well-known solution: BFT replication – Problem: require at least 2f+1 replicas Salus’ goal: Use f+1 computing nodes – f+1 is enough to guarantee safety – How about availability/liveness? 16
17
Active storage: unanimous consent Computing node Storage nodes 17
18
Active storage: unanimous consent Computing nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computing nodes. Additional benefit: – Colocate computing and storage: save network bandwidth 18
19
Active storage: restore availability Computing nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computing nodes 19
20
Active storage: restore availability Computing nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computing nodes – Computing nodes do not have persistent states. 20
21
Discussion Does Salus achieve BFT with f+1 replication? – No. Fork can happen in a combination of failures. Is it worth using 2f+1 replicas to eliminate this case? – No. Fork can still happen as a result of other events (e.g. geo-replication). 21
22
Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 22
23
Evaluation Is Salus robust against failures? Do the additional mechanisms hurt scalability? 23
24
Is Salus robust against failures? 24 Always ensure barrier semantics – Experiment: Crash block driver – Result: Requests after “holes” are discarded Single computing node cannot corrupt data – Experiment: Inject errors into computing node – Result: Storage nodes reject it and trigger replacement Block drivers never accept corrupted data – Experiment: Inject errors into computing/storage node – Result: Block driver detects errors and retries other nodes
25
Do the additional mechanisms hurt scalability? 25
26
Conclusion Pipelined commit – Write in parallel – Provide ordering guarantees Active Storage – Eliminate single point of failures – Consume less network bandwidth 26 High robustness ≠ Low performance
27
Salus: scalable end-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers 27
28
Scalable end-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers. 28
29
Scalable and robust storage More hardwareMore complex softwareMore failures 29
30
Start from “Scalable” Scalable Robust 30
31
Pipelined commit - challenge Is 2PC slow? – Locking? Single writer, no locking – Phase 2 network overhead? Small compare to Phase 1. – Phase 2 disk overhead? Eliminate that, push complexity to recovery Order across transactions – Easy when failure-free. – Complicate recovery. Key challenge: Recovery 31
32
Scalable architecture Computation node: No persistent storage Data forwarding, garbage collection, etc Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), … 32 Computation node Storage nodes
33
Salus: Overview Block store interface (Amazon EBS): – Volume: a fixed number of fixed-size blocks – Single block driver per volume: barrier semantic Failure model: arbitrary but no identity forge Key ideas: – Pipelined commit – write in parallel and in order – Active storage – replicate computation nodes – Scalable end-to-end checks – client verifies replies 33
34
Complex recovery Still need to guarantee barrier semantic. What if servers fail concurrently? What if different replicas diverge? 34
35
Support concurrent users No perfect solution now Give up linearizability – Casual or Fork-Join-Casual (Depot) Give up scalability – SUNDR 35
36
How to handle storage node failures? Computing nodes generate certificates when writing to storage nodes If a storage node fails, computing nodes agree on the current states Approach: look for longest and valid prefix After that, re-replicate data As long as one correct storage node is available, no data loss 36
37
Why barrier semantics? Goal: provide atomicity Example (file system): to append to a file, we need to modify a data block and the corresponding pointer to that block It’s bad if pointer is modified but data is not. Solution: modify data first and then pointer. Pointer is attached with a barrier. 37
38
Cost of robustness 38 Robustness Cost Crash Salus Eliminate fork Fully BFT Use 2f+1 replication Use signature and N- way programming
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.