Shiqin Yan, Huaicheng Li, Mingzhe Hao,

Slides:



Advertisements
Similar presentations
What is RAID Redundant Array of Independent Disks.
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
A New Cache Management Approach for Transaction Processing on Flash-based Database Da Zhou
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Lecture 13 Page 1 CS 111 Online File Systems: Introduction CS 111 On-Line MS Program Operating Systems Peter Reiher.
Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
FlashVM: Virtual Memory Management on Flash Mohit Saxena and Michael M. Swift Introduction Flash storage is the largest change to memory and storage systems.
International Conference on Supercomputing June 12, 2009
Boost Write Performance for DBMS on Solid State Drive Yu LI.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.
Solid State Drive Feb 15. NAND Flash Memory Main storage component of Solid State Drive (SSD) USB Drive, cell phone, touch pad…
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Parity Logging O vercoming the Small Write Problem in Redundant Disk Arrays Daniel Stodolsky Garth Gibson Mark Holland.
/38 Lifetime Management of Flash-Based SSDs Using Recovery-Aware Dynamic Throttling Sungjin Lee, Taejin Kim, Kyungho Kim, and Jihong Kim Seoul.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
Windows Server 2003 硬碟管理與磁碟機陣列 林寶森
DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
RL78 Code & Dataflash.
Lecture 3 Page 1 CS 111 Online Disk Drives An especially important and complex form of I/O device Still the primary method of providing stable storage.
Modeling and Simulating Time- Sensitive Networking Harri Laine.
A Semi-Preemptive Garbage Collector for Solid State Drives
A Lightweight Transactional Design in Flash-based SSDs to Support Flexible Transactions Youyou Lu 1, Jiwu Shu 1, Jia Guo 1, Shuai Li 1, Onur Mutlu 2 LightTx:
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
대용량 플래시 SSD의 시스템 구성, 핵심기술 및 기술동향
 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #
NAND Chip Driver Optimization and Tuning
CS Introduction to Operating Systems
Memory and Programmable Logic
Internal Parallelism of Flash Memory-Based Solid-State Drives
Storage Devices CS 161: Lecture 11 3/21/17.
WAFL: Write Anywhere File System
Homework Reading Tokheim, Chapter 12-1 through 12-4.
QoS-aware Flash Memory Controller
Database Applications (15-415) DBMS Internals- Part I Lecture 11, February 16, 2016 Mohammad Hammoud.
Jonathan Walpole Computer Science Portland State University
Jacob R. Lorch Microsoft Research
Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
On-Chip ECC for Low-Power SRAM Design
Disks and RAID.
Multilevel Memories (Improving performance using alittle “cash”)
Mingzhe Hao Andrew A. Chien Haryadi S. Gunawi
CS 554: Advanced Database System Notes 02: Hardware
An Adaptive Data Separation Aware FTL for Improving the Garbage Collection Efficiency of Solid State Drives Wei Xie and Yong Chen Texas Tech University.
The Memory Hierarchy Chapter 5
Operating Systems ECE344 Lecture 11: SSD Ding Yuan
Design Tradeoffs for SSD Performance
Solid-Sate Drives Ing. Stefan Verbruggen
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Mingzhe Hao, Huaicheng Li, Michael Hao Tong,
Operating System Concepts
BIC 10503: COMPUTER ARCHITECTURE
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Bridging the Information Gap in Storage Protocol Stacks
UNIT IV RAID.
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Parallel Garbage Collection in Solid State Drives (SSDs)
The SMART Way to Migrate Replicated Stateful Services
Tia Newhall, Daniel Amato, Alexandr Pshenichkin
Seminar on Enterprise Software
Dong Hyun Kang, Changwoo Min, Young Ik Eom
Design Tradeoffs for SSD Performance
Presentation transcript:

Tiny-Tail Flash Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman*, Andrew Chien, and Haryadi S. Gunawi * ceres.cs.uchicago.edu

TTFlash @ FAST’17 “if your read is stuck behind an erase you may have wait 10s of milliseconds. That’s a 100x increase in latency variance” http://www.zdnet.com/article/why-ssds-dont-perform/ The Tail at Scale [CACM’13] https://storagemojo.com/2015/06/03/why-its-hard-to-meet-slas-with-ssds/

Reads + Writes Clean/Empty SSD NoGC Convert to CDF 100% Percentile TTFlash @ FAST’17 NoGC 100% Convert to CDF Reads + Writes Read Latency Percentile 0.3ms 80% Clean/Empty SSD 0.3ms Time Read Latency

Long tail ! Reads + Writes Aged/Full SSD NoGC 80 ms! TTFlash @ FAST’17 80 ms! NoGC Objective: cut tail 100% 3% ≥5 ms Reads + Writes with GC Long tail ! Percentile Read Latency 0.3ms 80% Aged/Full SSD 0.3ms 80ms Time Read Latency

How GC delays read I/Os? fast delayed! Read A B TTFlash @ FAST’17 How GC delays read I/Os? Read A B fast delayed! A GC moves tens of valid pages! which makes channel/chips busy for tens of ms ! A B Channel Chip

How to cut tail latencies? TTFlash @ FAST’17 How to cut tail latencies? Tail-tolerant techniques in distributed/storage systems: Leverage redundancy to cut tail! Full Stripe Read C = XOR (A, B, P) A C B fast tail! RAID: A P B C Slow / busy

How to cut tails in SSD? SSD: A B C P slow! TTFlash @ FAST’17 How to cut tails in SSD? Error rate increases  RAIN (Redundant Array of Independent NAND) Similarly, we leverage RAIN to cut “tails”! Full Stripe Read A C B C = XOR (B, C, P) slow! fast SSD: A B C P GC

(Parity-based Redundancy) Contribution Plane-Blocking GC GC-Tolerant Read Rotating GC GC-Tolerant Flush New techniques: Current SSD technology: RAIN (Parity-based Redundancy)

Results NoGC +Rotating GC +GC-Tolerant Read +Plane-Blocking Base 100% TTFlash @ FAST’17 Results NoGC +Rotating GC 100% 95% 0.3ms 80ms CDF (Percentile) Latency +GC-Tolerant Read +Plane-Blocking Base

Overall results achieved: Between 99 - 99.99th percentiles: TTFlash @ FAST’17 100% 95% 0.3ms 80ms CDF (Percentile) Latency Tiny tail! Overall results achieved: Between 99 - 99.99th percentiles: ttFlash 1-3x slower than NoGC Base 5-138x slower than NoGC

Outline Evaluation, limitations, conclusion Introduction Background TTFlash @ FAST’17 Outline Introduction Background Tiny-Tail Flash Design Evaluation, limitations, conclusion

SSD Internals C0 C1 CN Chip … … … … … Chip Die [0] Die [1] Plane[0] TTFlash @ FAST’17 SSD Internals C0 C1 CN Chip Die [0] Die [1] Plane[0] … … … … … Plane[N] Chip

SSD Internals C0 … C1 … CN … Plane … … … Chip Plane Block[0] Block[N] TTFlash @ FAST’17 SSD Internals C0 … C1 … CN … Plane Block[0] Block[N] Valid Page … … … Chip Plane

GCed pages block the channel TTFlash @ FAST’17 14 SSD Controller for (1 … # of valid pages): 1. read to controller (check with ECC) 2. write to another block blocked! 1 2 Old block Empty block GCed pages block the channel …

Erase operation block the plane TTFlash @ FAST’17 15 SSD Controller 3. Erase the old block Old block Empty block Erase! Erase operation block the plane …

blocked! Channel blocking GC “Base” approach … GCing plane A B C 16 TTFlash @ FAST’17 16 blocked! Channel blocking GC “Base” approach A B C … GCing plane

Base (Channel-Blocking) TTFlash @ FAST’17 NoGC 100% 95% 0.3ms 80ms Latency Base (Channel-Blocking) CDF (Percentile)

Outline Evaluation, limitations, conclusion Introduction Background TTFlash @ FAST’17 Outline Introduction Background Tiny-Tail Flash Design Plane-Blocking GC GC-Tolerant Read Rotating GC GC-Tolerant Flush Evaluation, limitations, conclusion

intra-plane copyback support TTFlash @ FAST’17 19 Base: Channel Blocking Plane Blocking blocked! Leverage intra-plane copyback support Unblock the channel A A B B C C … … GCing plane GCing plane

SSD Controller Plane Blocking Read Page Overlap TTFlash @ FAST’17 20 Plane Blocking Base GC Logic: for (every valid page) 1. flash read+write (over channel) 2. wait SSD Controller Plane Blocking GC Logic: for (every valid page) flash read+write (inside plane) serve other user I/Os 2 A Read Page B C 1 “Intra-plane copyback” 1 Old block Empty block 2 Overlap intra-plane copyback with channel usage for other non-GCing planes … 1 2

1.5% 3% of I/Os NoGC Only +Plane-Blocking are blocked by GC 100% 95% 0.3ms 80ms Latency Only 1.5% 3% of I/Os are blocked by GC +Plane-Blocking Base (Channel-Blocking)

delayed! Read X Read Y Read Z GC-ing plane still blocks TTFlash @ FAST’17 Issue 1: No ECC check for garbage-collected pages (will discuss later) Issue 2: X Read X Read Y Y Read Z GC-ing plane still blocks delayed! Z

Outline Evaluation, limitations, conclusion Introduction Background TTFlash @ FAST’17 Outline Introduction Background Tiny-Tail Flash Design Plane-Blocking GC RAIN + GC-Tolerant Read Rotating GC GC-Tolerant Flush Evaluation, limitations, conclusion

RAIN LPN (Logical Page #) Static mapping: LPN0  C[0]PG[0] TTFlash @ FAST’17 RAIN LPN (Logical Page #) Static mapping: LPN0  C[0]PG[0] LPN1  C[1]PG[0] … Add parity: LPN 0, 1, 2  P0,1,2 Rotating parity as RAID 5 C0 C1 C2 C3 1 2 P0,1,2 PG0 3 4 P3,4,5 5 PG1 6 P6,7,8 7 8 PG2

vs. RAIN enables GC-Tolerant Read tail 2 = XOR (0, 1, P0,1,2) 2 1 1 2 Full Stripe Read 2 = XOR (0, 1, P0,1,2) 2 1 1 2 Read in parallel + XOR cost ~0.01 ms fast tail 1 2 P0,1,2 vs. GC Wait for GC 2 to 10s of ms

Issue: partial stripe read TTFlash @ FAST’17 GC-Tolerant Read Issue: partial stripe read Partial stripe read: 2 2 = XOR (0, 1, P0,1,2) Must generate extra N-1 reads! Add contention to other N -1 channels and planes Convert to full stripe if: Textra-reads < TGC slow! 1 2 P0,1,2

0.5% NoGC +GC-Tolerant Read +Plane-Blocking Base 100% CDF (Percentile) TTFlash @ FAST’17 NoGC 0.5% 100% 95% 0.3ms 80ms CDF (Percentile) Latency +GC-Tolerant Read +Plane-Blocking Base

Issue: more than 1 GCs in a plane group? 2 tails! TTFlash @ FAST’17 Issue: more than 1 GCs in a plane group? One parity  cut one tail Can’t cut two tails! Full-stripe read 2 1 2 tails! DOES NOT HELP! 1 2 P0,1,2 PG0 GC GC GC

Outline Evaluation, limitations, conclusion Introduction Background TTFlash @ FAST’17 Outline Introduction Background Tiny-Tail Flash Design Plane-Blocking GC GC-Tolerant Read Rotating GC GC-Tolerant Flush Evaluation, limitations, conclusion

Rotating GC: Postpone! PG0 TTFlash @ FAST’17 Postpone! Rotating GC: Anytime, at most 1 plane per plane group can perform GC 1 2 P0,1,2 PG0

Rotating GC: Rotating! PG0 TTFlash @ FAST’17 Rotating! Rotating GC: Anytime, at most 1 plane per plane group can perform GC 1 2 P0,1,2 PG0

TTFlash @ FAST’17 Rotating GC: Anytime, at most 1 plane per plane group can perform GC Concurrent GCs in different PGs are permitted. 1 2 P0,1,2 PG0 PG1 PG2

0.5% Why still tiny tails? +Rotating GC Tiny tail! TTFlash @ FAST’17 +Rotating GC 0.5% 100% 95% 0.3ms 80ms CDF (Percentile) Latency Tiny tail! Why still tiny tails? Small/partial-stripe read  Sometimes may be better to wait for GC than adding extra reads/contentions!

Outline Tiny-Tail Flash Design Evaluation Limitations conclusion TTFlash @ FAST’17 Outline Tiny-Tail Flash Design Plane-Blocking GC GC-Tolerant Read Rotating GC GC-Tolerant Flush (in paper) Evaluation Limitations conclusion

Implementation SSDsim (~2500 LOC) VSSIM (~900 LOC) OpenSSD TTFlash @ FAST’17 Implementation SSDsim (~2500 LOC) Device simulator VSSIM (~900 LOC) QEMU/KVM-based Run Linux and applications OpenSSD Many limitations of the simple programming model Future: ttFlash on OpenChannel SSD

Evaluation Simulator: SSDsim (verified against hardware) TTFlash @ FAST’17 Evaluation Simulator: SSDsim (verified against hardware) Workload: 6 real-world traces from Microsoft Windows Settings and SSD parameters: SSD size: 256GB, plane group width = 8 planes (1 parity, 7 data)

Developer Tools Release Server Trace TTFlash @ FAST’17 Developer Tools Release Server Trace NoGC +Rotating GC ttFlash 99.99th 100% 95% 0.3ms 80ms Latency +GC-Tolerant Read +Plane-Blocking Result: 99.99th percentile: ttFlash 3x slower than NoGC Base 138x slower than NoGC CDF (Percentile) Base

Evaluated on 6 windows workload traces with various characteristics TTFlash @ FAST’17 Evaluated on 6 windows workload traces with various characteristics Reduced blocked I/Os (total) from 2 – 7% to 0.003 – 0.05% 99 – 99.99%: 1.0 – 2.6x slower for ttFlash and 5.6 – 138.2x for Base

Other Evaluations Filebench on VSSIM+ttFlash Vs. Preemptive GC TTFlash @ FAST’17 Other Evaluations Filebench on VSSIM+ttFlash ttFlash achieves better average latency than base case Vs. Preemptive GC ttFlash is more stable than semi-preemptive GC (If no idle time, preemptive GC will create GC backlogs, creating latency spikes) 6 Latency (s) ttflash stable 4386 Elapsed time (s) 4522

Tradeoffs/Limitations ttFlash depends on RAIN 1 parity for N parallel pages/channels We set N = 8, so we lose one channel out of 8 channels. Average latencies are 1.09 – 1.33x slower than NoGC, No-RAIN case RAID  more writes (P/E cycles) ttFlash increases P/E cycles by 15 – 18% for most of workloads Incur > 53% P/E cycles for TPCC, MSN (random write) ECC is not checked during GC Suggest background scrubbing (read is fast & not as urgent as GC) Important note: in ttFlash, foreground/user reads are still ECC checked

Tails under Write Bursts Latency CDF w/ Write Bursts ttFlash 55MB/s 90% ttFlash 64MB/s CDF (Percentile) Base 64MB/s 20% Latency (ms) 80 ms Under write burst and at high watermark, ttFlash must dynamitcally disable Rotating GC to ensure there is always enough number of free pages.

Conclusion ttFlash GC-induced New techniques: long tail Plane-Blocking GC GC-Tolerant Read Rotating GC GC-Tolerant Flush CDF (Percentile) Overall results achieved: Between 99 - 99.99th percentiles: ttFlash 1-3x slower than NoGC Base 5-138x slower than NoGC Latency technology: Powerful Controller RAIN (parity-based redundancy) Capacitor-backed RAM

Thank you! Questions? http://ucare.cs.uchicago.edu TTFlash @ FAST’17 Thank you! Questions? http://ucare.cs.uchicago.edu https://ceres.uchicago.edu