StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003 USENIX Annual Technical Conference Presenter: D 林敬棋
Introduction Important data need to be protected. ◦ Making replicas. Replication on remote sites ◦ Reduce the amount of data lost in failure. ◦ Decrease the time required to recover from catastrophic site failure.
StarFish A highly-available geographically-dispersed block storage system. ◦ Does not require expensive dedicated communication lines to all replicas to achieve highly-available. ◦ Achieves good performance even during recovery from a replica failure. ◦ Single-owner access semantics.
Architecture StarFish consists of ◦ One Host Element(HE) Provides storage virtualization and read cache. ◦ N Storage Element(SE) Q: write quorum size. Synchronous updates to a quorum of Q SEs, and asynchronous updates to the rest.
Recommended Setup N = 3, Q = 2 MAN: Metropolitan Area Network WAN:Wide Area Network
Another Deployment
SE Recovery Write log ◦ HE keeps a circular buffer of recent writes. ◦ Each SE maintains a circular buffer of recent writes on a log disk. Three types of recovery ◦ Quick recovery ◦ Replay recovery ◦ Full recovery
Availability and Reliability Assume that the failure and recovery processes of the network links and SEs are i.i.d Poisson processes with combined mean failure and recovery rates of λ and μ per second. Similarly, the HE has Poisson-distributed λ he and μ he.
Availability The steady-state probability that at least Q SEs are available. Derived from the standard machine repairman mode.
Machine Repairman Model
Availability(cont.)
Availability(cont.) X ★ 9 : the number of 9s in an availability measure. Achieve a much higher availability when N = 2Q + 1. For fixed N, availability decrease with larger quorum size. ◦ Increasing quorum size trades off availability for reliability.
Reliability The probability of no data loss. The reliability increases with larger Q. Two approaches ◦ Make Q > floor(N/2) and at least Q SEs are available. Reduce availability and performance. ◦ Read-only consistency
Read-only Consistency Available in read-only mode during failure. ◦ Read-only mode obviates the need for Q SEs to be available to handle updates. ◦ Increase availability
Availability with Read-only Consistency
Observations If ρ he = 0, availability is independent of Q. ◦ Can always recover from HE. If ρ he increase, availability increase with Q. Largest increase occurs from Q = 1 to Q = 2, and bounded by 3/16 when ρ = 1. ◦ Diminishing gain after Q = 2. ◦ Suggest Q = 2 in practical system.
Availability with Read-only Consistency(cont.) N < 2 Q
Implementation
Performance Measurements Compares with a direct-attached RAID unit.
Settings Different network delays ◦ 1, 2, 4, 8, 23, 36, 65 ms Different bandwidth limitations ◦ 31, 51, 62, 93, 124 Mb/s. Benchmark: ◦ Micro-benchmark Read hit Read miss Write ◦ PostMark
Effects of network delays and HE cache size Near SE delay: 4ms; Far SE delay: 8ms No cache miss if HE cache size = 400 MB
Observation Large HE cache improves performance. ◦ HE can respond to more read requests without communicating with SE. Does not change write requests. ◦ Especially beneficial when local SE has significant delays. Q = 2 and 400MB cache size is not influenced by the delay to local SE. ◦ Depend on near SE.
Normal Operation and placement of the far SE 1-8: 1, 2, 4, 8 ms; 4-12: 4, 8, 12 ms 23-65: 23, 36, 65 ms; : 31,51,62,93,124 Mbps Local SE delay: 0ms N = 3
Normal Operation and placement of the far SE(Cont.) N = 3 8 threads
Normal Operation and placement of the far SE(Cont.)
Observation Performance is influenced mostly by two parameters ◦ Write quorum size ◦ Delay to the SE. StarFish can provide adequate performance when one of the SEs is placed in a remote location. ◦ At least 85% of the performance of a direct- attached RAID.
Recovery Performance degrades more during full recovery.
Conclusion The StarFish system reveals significant benefits from a third copy of the data at an intermediate distance. A StarFish system with 3 replicas, a write quorum size of 2, and read-only consistency yields better than % availability assuming individual Storage Element availability of 99%.