Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mingzhe Hao Andrew A. Chien Haryadi S. Gunawi

Similar presentations


Presentation on theme: "Mingzhe Hao Andrew A. Chien Haryadi S. Gunawi"— Presentation transcript:

1 Mingzhe Hao Andrew A. Chien Haryadi S. Gunawi
The Tail at Store A Revelation from Millions of Hours of Disk and SSD Deployments Mingzhe Hao Andrew A. Chien Haryadi S. Gunawi Gokul Soundararajan Deepak Kenchammana- Hosekote

2 Fast Storage devices 15K RPM ~20K RPM
xpoint-a-new-memory-architecture-that-claims-to-outclass-both-ddr4-and-nand

3 Disk/SSD performance failures?
Fast? Not always Disk/SSD performance failures? Why? Yes Disks Weak disk head, bad packaging, missing screws, broken/old fans, too many disks/box, firmware bugs, bad sector remapping Bandwidth drops by 80%, and introduces seconds of delay SSD/Flash Firmware bug, GC, … 4 – 100x slowdown Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems [SoCC '13, HotCloud ’13]

4 Mingzhe, these anecdotes are great, but…
Really serious problems? How often? Great questions!... Hmmm .. Transient or permanent?

5 The Tail at Store Study of over 450,000 disks and 4000 SSDs
In production (customer deployments) Deployed as RAID groups 87 days on average Total: 857 million disk hours and 7 million SSD hours The largest study of storage performance variability

6 Mingzhe, these anecdotes are great, but…
Really serious problems? Great questions!.. We HAVE the data! How often? Transient or permanent?

7 Outline Methodology Slowdown characterizations
Dataset Major metrics Slowdown characterizations Temporal & spatial analysis Workload analysis Conclusion

8 D D … D P Q Disk SSD #RAID groups 38,029 572 #Data drives per group
3-26 3-22 Disk SSD #RAID groups 38,029 572 #Data drives per group 3-26 3-22 #Data drives 458,482 4,069 Total drive hours 857,183,442 7,481,055 Total RAID hours 72,046,373 1,072,690 Disk SSD #RAID groups 38,029 572 #Data drives per group 3-26 3-22 #Data drives 458,482 4,069 Disk SSD #RAID groups 38,029 572 RAID D D D P Q

9 Metrics Hourly average Latency(ms) 10 9 22 8 25
RAID Hourly average i=1 2 3 n Latency(ms) Latency(ms) 10 Slowdown x x x x x Slowdown x x x 2.2x 2.5x Li = hourly average I/O latency per drive Lmedian = median (L1…LN) Slowdown  Si = Li / Lmedian Slow drive hour  Si ≥ 2x

10 Longest tail  T = Max of (S1..N) per hour “Tail hour”  T ≥ 2x
Full stripe workload Slow! Slow! RAID i=1 2 3 n Slowdown x x x x Slowdown x x x x x 2.5x Longest tail  T = Max of (S1..N) per hour Latency of full-stripe I/O follows the longest tail “Tail hour”  T ≥ 2x i.e. at least one drive ≥ 2x

11 Outline Methodology Slowdown characterizations
Temporal & spatial analysis Workload analysis Conclusion

12 Slowdown distribution (DISK)
1pm: 2pm: 3pm: 4pm: Bad! Good Slowdown values Per drive per hour 800+ million slowdown values Si = Li / Lmedian

13 Slowdown distribution (DISK)
Si = 2x at 99.8 percentile 0.2% ≥2x 0.7% in 1000 drive hours, 2 hours ≥ 2x slower ≥1.5x Si = 1.5x at 99.3 percentile in 1000 drive hours, 7 hours ≥ 1.5x slower Si = 1.1x, 1.2x … (more analysis possible)

14 Slowdown distribution - SSD
Si = 2x at 99.4 percentile 0.6% in 1000 drive hours, 6 hours ≥ 2x slower 1.3% ≥2x ≥1.5x Si = 1.5x at 98.7 percentile in 1000 drive hours, 13 hours ≥ 1.5x slower SSD slowdown distribution is worse

15 Longest tail  T = Max of (S1..N) per hour
RAID Slowdown x x x x 2.5x Longest tail  T = Max of (S1..N) per hour 72 million T values for disk RAID, and million T values for SSD RAID

16 Tail distribution - Disk
1.5% T = 2x at 98.5 percentile 4.6% ≥2x in 1000 RAID hours 15 hrs at least one ≥ 2x slower disk ≥1.5x T = 1.5x at 95.4 percentile in 1000 RAID hours 46 hrs at least one ≥ 1.5x slower disk T is much worse than Si

17 Tail distribution - SSD
2.2% T = 2x at 97.8 percentile in 1000 RAID hours 22 hrs at least one ≥ 2x slower SSD 4.8% ≥2x ≥1.5x T = 1.5x at 95.2 percentile in 1000 RAID hours 48 hrs at least one ≥ 1.5x slower SSD

18 Slowdown is not uncommon!
DISK SSD 2x Slow hours (%) 0.22% 0.58% 1.5x Slow hours (%) 0.69% 1.27% Millions of slow disk hours & Tens of thousands of slow SSD hours DISK SSD 2x Tail hours (%) 1.54% 2.23% 1.5x Tail hours (%) 4.56% 4.83% Hard to get performance stability at 95 to 99.9 percentile!

19 Outline Methodology Slowdown characterizations
Temporal & spatial analysis Workload analysis Conclusion

20 Many slowdowns happen in
Slowdown Interval Q: How long can a slowdown last? (slowdown interval) 12% Slowdown, slow drives …: ≥ 2x ≥10 hours 40% 1 hr 1 hr 3 hrs 3 hrs ≥2 hours ≥2x ≥2x Time 40% of disk slowdown intervals ≥ 2 hrs Many slowdowns happen in consecutive hours 12% of disk slowdown intervals ≥ 10 hrs

21 Slowdown Inter-arrival
Q: What are inter-arrivals between slow drive hours? ≤24 hours ≤24 hours ≥2x 1 hr 1 hr 3 hrs 3 hrs ≥2x 90% 85% ≥2x Time 90% of disk slowdown inter-arrivals are within 24hours Slow drive hour is a good indicator for further slowdowns 85% of SSD slowdown inter-arrivals are within 24hours

22 A large portion of drives have experienced slowdowns
Slow Drive Population Q: How many drives have experienced at least a slowdown within dataset time range (87 days)? 25% of disks have seen ≥ 2x slowdown 29% of SSDs have seen ≥ 2x slowdown X A large portion of drives have experienced slowdowns Replacement not a solution

23 Outline Methodology Slowdown characterizations
Temporal & spatial analysis Workload analysis I/O rate and size imbalance Conclusion

24 Q: Why is a drive slow? I/O rate imbalance? I/O size imbalance?
RI = 4x ≥ 2x slower I/O size imbalance? 10x ZI = 10x ≥2x slower

25 I/O rate imbalance is not a primary cause!
RI = 3x RI = 3x 5% ≥2x I/Os RI = 1x RI = 1x 5% of slow hours ≥ 2x more I/Os I/O rate imbalance is not a primary cause!

26 I/O size imbalance is not a primary cause!
10x ZI = 10x ZI = 10x 2% ≥2x larger I/Os ZI = 1x ZI = 1x 2% of slow hours ≥ 2x larger I/Os I/O size imbalance is not a primary cause!

27 Other correlations: disk age
Disk ages: Older disks are more unstable worse

28 Other correlations: Flash cells and vendors
Performance instability: SLC < MLC Vendor matters worse worse

29 Other findings No correlation to time of the day (0am – 24pm)
Nightly background events not a factor No explicit drive events around slow hours Slowdown is a “silent” fault Slow drive replacement rate is low Unplug: 4-8% (within 24 hours after a slowdown) Replug: % of unplugged drives are replugged

30 Conclusion Drive performance variability is real Tail drives in RAID:
Stable hard to achieve Rate and size imbalance are not a factor “Silent” events Internal complexities? Tail drives in RAID: A slow drive affects the entire RAID performance (20 drives/RAID is common) Need tail tolerance at low-level RAID layer

31 Questions? Thank you! http://ucare.cs.uchicago.edu


Download ppt "Mingzhe Hao Andrew A. Chien Haryadi S. Gunawi"

Similar presentations


Ads by Google