Parallel Garbage Collection in Solid State Drives (SSDs)

Slides:

Advertisements

Similar presentations

Solid State Drive. Advantages Reliability in portable environments and no noise No moving parts Faster start up Does not need spin up Extremely low.

Advertisements

Paper by: Yu Li, Jianliang Xu, Byron Choi, and Haibo Hu Department of Computer Science Hong Kong Baptist University Slides and Presentation By: Justin.

SLC MLC VtVtVtVt VtVtVtVt % Cells.

Flash storage memory and Design Trade offs for SSD performance

Snapshots in a Flash with ioSnap TM Sriram Subramanian, Swami Sundararaman, Nisha Talagala, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau Copyright © 2014.

Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)

1 Stochastic Modeling of Large-Scale Solid-State Storage Systems: Analysis, Design Tradeoffs and Optimization Yongkun Li, Patrick P. C. Lee and John C.S.

Trading Flash Translation Layer For Performance and Lifetime

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

International Conference on Supercomputing June 12, 2009

Boost Write Performance for DBMS on Solid State Drive Yu LI.

Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.

Comparing Coordinated Garbage Collection Algorithms for Arrays of Solid-state Drives Junghee Lee, Youngjae Kim, Sarp Oral, Galen M. Shipman, David A. Dillow,

Solid State Drive Feb 15. NAND Flash Memory Main storage component of Solid State Drive (SSD) USB Drive, cell phone, touch pad…

File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.

Operating Systems CMPSC 473 I/O Management (2) December Lecture 24 Instructor: Bhuvan Urgaonkar.

Understanding Intrinsic Characteristics and System Implications of Flash Memory based Solid State Drives Feng Chen, David A. Koufaty, and Xiaodong Zhang.

Flash memory Yi-Chang Li

Origianal Work Of Hyojun Kim and Seongjun Ahn

Embedded System Lab. 서동화 HIOS: A Host Interface I/O Scheduler for Solid State Disk.

Logging in Flash-based Database Systems Lu Zeping

/38 Lifetime Management of Flash-Based SSDs Using Recovery-Aware Dynamic Throttling Sungjin Lee, Taejin Kim, Kyungho Kim, and Jihong Kim Seoul.

SOLID STATE DRIVES By: Vaibhav Talwar UE84071 EEE(5th Sem)

2010 IEEE ICECS - Athens, Greece, December1 Using Flash memories as SIMO channels for extending the lifetime of Solid-State Drives Maria Varsamou.

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.

Embedded System Lab. Jung Young Jin The Design and Implementation of a Log-Structured File System D. Ma, J. Feng, and G. Li. LazyFTL:

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

Operating Systems CMPSC 473 I/O Management (3) December 07, Lecture 24 Instructor: Bhuvan Urgaonkar.

A Semi-Preemptive Garbage Collector for Solid State Drives

연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin

A Lightweight Transactional Design in Flash-based SSDs to Support Flexible Transactions Youyou Lu 1, Jiwu Shu 1, Jia Guo 1, Shuai Li 1, Onur Mutlu 2 LightTx:

Lecture 22 SSD. LFS review Good for …? Bad for …? How to write in LFS? How to read in LFS?

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.

대용량 플래시 SSD의 시스템 구성, 핵심기술 및 기술동향

XIP – eXecute In Place Jiyong Park. 2 Contents Flash Memory How to Use Flash Memory Flash Translation Layers (Traditional) JFFS JFFS2 eXecute.

 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.

1 Paolo Bianco Storage Architect Sun Microsystems An overview on Hybrid Storage Technologies.

CPSC 426: Building Decentralized Systems Persistence

Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #

Solid State Disk Prof. Moinuddin Qureshi Georgia Tech.

Internal Parallelism of Flash Memory-Based Solid-State Drives

COS 518: Advanced Computer Systems Lecture 8 Michael Freedman

Storage Devices CS 161: Lecture 11 3/21/17.

QoS-aware Flash Memory Controller

Understanding Modern Flash Memory Systems

Lecture: Large Caches, Virtual Memory

Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.

Shiqin Yan, Huaicheng Li, Mingzhe Hao,

Cache Memory Presentation I

An Adaptive Data Separation Aware FTL for Improving the Garbage Collection Efficiency of Solid State Drives Wei Xie and Yong Chen Texas Tech University.

The Memory Hierarchy Chapter 5

Repairing Write Performance on Flash Devices

Introduction I/O devices can be characterized by I/O bus connections

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose,

Design Tradeoffs for SSD Performance

Disk scheduling In multiprogramming systems several different processes may want to use the system's resources simultaneously. The disk drive needs some.

HashKV: Enabling Efficient Updates in KV Storage via Hashing

COS 518: Advanced Computer Systems Lecture 8 Michael Freedman

Lecture: Cache Innovations, Virtual Memory

PARAMETER-AWARE I/O MANAGEMENT FOR SOLID STATE DISKS

Lecture 11: Flash Memory and File System Abstraction

COS 518: Advanced Computer Systems Lecture 9 Michael Freedman

CS 295: Modern Systems Storage Technologies Introduction

Endurance Group Management: Host control of SSD Media Organization

Overview Problem Solution CPU vs Memory performance imbalance

Introduction to Operating Systems

Dong Hyun Kang, Changwoo Min, Young Ik Eom

Design Tradeoffs for SSD Performance

Presentation transcript:

Parallel Garbage Collection in Solid State Drives (SSDs) Narges Shahidi PhD candidate @ Pennsylvania State University

Outline Solid State Drives Proposed Parallel Garbage Collection NAND Flash Chips Update process in SSDs Flash Translation Layer (FTL) Garbage Collection Process Proposed Parallel Garbage Collection Presented in SuperComputing (SC’2016)

Storage Systems SSDs replacing HDDs in Enterprise and Client applications Increased performance ( 400 IOPS in HDD vs. >6K IOPS in SSDs) Lower Power (6-15w in HDDs vs. 2-5w in SSDs) Smaller form factor Variety of device interface No acoustic noise Higher Price (4x HDD) Endurance

What is different in NAND Flash SSDs? Read/write/erase latency (100us/1000us/3~5ms) Read 10x faster than write Erase-before-write: Cell value can change from 1→0, but not from 0→1. Erase unit is block and write unit is page Endurance: flash cells wear out P/E cycle ( 1000 - 100000) Results in flash capacity reduction NAND Flash Chip Block Page-1 Page-2 ... Page-1 Page-2

Updating a page in SSD Update a page is very expensive: Read the whole block Erase the whole block Change one single page Write back the whole block Log-based update: Write new data in any other location → need mapping table to map logical to physical addresses Step 1: Translate LPA address to a physical page address via “Mapping Table”. Step 2: Invalid old page and request/program a new page Step 3: Update “Mapping table” with new physical address

Flash Firmware (Flash Translation Layer) SSD needs extra space for update called over-provisioning The capacity that is invisible to the user SSDs have 7%-28% over-provisioned space E.g. a 1GB SSD has 10^9 Byte (physical capacity is 2^30 =7% over-provisioning) Need mapping of logical to physical addresses (Mapping Table) Stale data needs to be erased to free up more space for future updates Need Garbage Collection

SSD Layout Levels of parallelism: System level parallelism: Channels and Chips Flash level parallelism: Die and Plane Flash-Level parallelism is not studied as much Needs hardware supports Flash vendors provide multi-plane/two-plane operations

Performance metrics in SSD Three basic metrics: IOPS (IO operations Per Second) Throughput or Bandwidth (MBps) Response Time or Latency (ms): Average and maximum response time Access pattern of workload: Random/Sequential - the random or sequential nature of the data address request Block Size - the data transfer length Read/Write ratio - the mix of read and write operation

Garbage Collection: Why? Update-in-place is not possible in flash memories Update marks pages invalid Garbage Collection reclaims invalid pages Garbage Collection cause high tail latency Even a few amount of update can launch garbage collection and deny SLA Flash chip cannot respond to IO requests during the Garbage Collection (GC) time, leads to high tail latency Background GC is a solution, but Enterprise SSDs are 24x7

Garbage Collection: How? Step 1: Select a block to erase (victim block) using GC algorithm. Step 2: Move valid pages out of the block to another location in SSD Step 3: Erase the block Moving valid pages to another location in the SSD needs read and write to new location. This increase the write number to the SSD (write amplification) More write to the SSD means more erase → reduce life time of flash cells Moving pages occupies channels and flash chips and cause delay in servicing normal request → tail latency

Garbage Collection: When? Based on the amount of free space in the SSD: Free space < BG GC Threshold → start Background GC Free space < GC Threshold → start on-demand GC and continue to reach BG GC Threshold Background GC Threshold GC Threshold

Performance effect of GC Consistent and predictable performance is one of the most important metrics for storage workloads, especially in Enterprise SSDs Tail latency penalty is harmful -- violate consistent performance Update-in-place is not possible in flash memories Overwrites mark old pages invalid Garbage Collection reclaims invalid blocks result in large tail latencies Client SSDs are 20/80 duty cycle (20 active/80 idle): higher tolerable delta between min and max response time Enterprise SSDs use higher over-provisioned area, offer higher steady-state sustained performance

Steady State

“Exploiting the potential of Parallel Garbage Collection in SSDs for Enterprise Storage Systems” Presented in: SuperComputing (SC) 2016 Salt Lake City - Utah

High Level of parallelism in SSDs Levels of parallelism: System level parallelism: Channels and Chips Flash level parallelism: Die and Plane Flash-Level parallelism is not studied as much Needs hardware supports Flash vendors provide multi-plane/two-plane operations Multi-plane operations launches multiple reads/writes/erases at planes of the same die Enables simultaneous operations on two pages in parallel, one in each plane At the latency of one read/write/erase operation Multi-plane operation can improve throughput by 100% using cache mode Solid state drives have four levels of parallelism, first channels, Then channels are shared meduim and are connected to flash chips or flash packages. A flash packages contains dies, 1, 2, or 4 dies And within a die we have several planes Die is the smallest independent unit, however, within a die, some flash vendors provide multi-plane operations. Two planes of a single die cannot operate independently, but can work synchronously using multi-plane or two-plane operations. Multi-plane operations can launch multiple read/write/erase at planes of a single die The latency of read/write/erase operation will be the same. Channel will be the shared medium to transfer data. Using the cache mode, multi-plane operation can double the throughput Simultaneous read and write operation to two page, and simultaneous erase operation to two block, one in each plane of a single die

Multi-Plane command Restrictions on multi-plane commands: Same physical die Restrictions on physical address Identical page address bits Restrictions reduce the opportunity to leverage plane-level parallelism Cause idle time in plane, and low plane-level utilization The plane-level parallelism can be improved by: Plane-first allocations improve the chance to leverage multi-plane operations Super-page: attach pages of different planes and make a large page Although these approach can improve flash-level parallelism, but it’s still highly depend on the workload Plane-first allocation strategies allocae flash-level resources rather than chip, channel in an attempt to leverage flash-level parallelism.

Response Time Response time of an IO request includes waiting time and service time: Service Time of request: Command and data transfer + operation latency Waiting Time: Resource Conflict: Die/Channel is busy servicing IO Request (CnGC=Conflict with non- GC operations) Conflict with GC on the target plane (CsGC) Conflict with GC on the other plane (CoGC) Conflict with non-GC requests because of GC or Late Conflict (LC) What latency components contribute in the response time

IO Request Response Time We considered several types of applications, and see how response time can be broken down to waiting components

Plane-level Parallel Garbage Collection (PaGC) Parallel GC IO Requests On-demand GC Plane 1 Plane 2 IDLE Time IO Requests IDLE Early GC IO Requests On-demand GC

Plane-level Parallel Garbage Collection Why? Leverage the idle time opportunity and improve plane-level parallelism When? When a plane starts On-demand GC, make it parallel GC How? Leverage Multi-plane operation Garbage Collection: 1) selecting victim block 2) move valid pages 3) erasing the block Challenges: Multi-plane erase is straight-forward, but Moving valid pages is more challenging

Parallel GC: How? Parallel Read - Parallel Write ( PR-PW) victim block active block Parallel Read - Parallel Write ( PR-PW) Try to find maximum PR-PW possible moves Can be even faster by using copy-back operations → Multi-plane Copy-Back read/write Latency = 1 Read + 1 Write Serial Read - Parallel Write (SR-PW) Read is ~10x faster than Write Read two pages in serial, write in parallel Latency = 2 Read + 1 Write Serial Read - Serial Write (SR-SW) More valid pages in one of the blocks Page-1 Page-2 Plane 1 Page-1 Page-2 victim block active block Page-1 Plane 2 Page-2 Page-3 Page-1 Page-3 Page-2

Parallel GC: How? Plane 1 IO Traditional GC IDLE Plane 2 Parallel GC On-demand GC Plane 1 Plane 2 IDLE Move Pages Erase Traditional GC Parallel GC IO Erase PR-PW / SR-PW / SR-SW Plane 1 Plane 2 IO Parallel GC Plane 1 Plane 2 Erase Move Pages Parallel GC Move Pages Erase

Parallel GC: When? Plane 1 IO GC Th Plane 2 P1 P2 PR-PW / SR-PW IDLE Erase GC Th Parallel GC PR-PW / SR-PW SR-SW Erase P1 P2 Launching Blind Parallel GC can cause very inefficient GC and increase plane busy time PaGC Th GC Th We use PaGC Th as a metric to start launching PaGC Around 5% more than Traditional GC th in our experiments Threshold-based Paralllel Garbage Collection (T-PaGC) P1 P2

Parallel GC: When? Plane 1 IO PaGC Th GC Th Plane 2 P1 P2 Plane 1 IO Erase PR-PW / SR-PW SR-SW IDLE PaGC Th GC Th P1 P2 IO Parallel GC Plane 1 Plane 2 Erase PR-PW / SR-PW SR SW Threshold-based Cache-aware Parallel Garbage Collection

Garbage Collection Efficiency 1 Block 2 Blocks

Experimental Results Simulation Platform: Trace-driven simulation with SSDSim Workloads: UMASS Trace repository Microsoft Research Cambridge traces SSD Organization Flash Organization 800GB (25% over-provisioning) Capacity: 8GB , MLC Host interface (PCIe 2.0) 1 die/chip, 2 planes/die 16 channels, 8 chips/channel Block size: 2MB Page size: 8KB Channel rate: 333MT/s Read/Write/Erase: 75/1500/3800 μs Mention that the references can be found in the paper

Page Move breakdown: PR-PW, SR-PW, SR-SW (self & other) GC efficiency: PaGC normalized to baseline (Max = 2)

IO request Response Time Plane GC active time (%) GC-Self: GC in plane GC-other: GC in other plane

Q & A Thanks!

GC Selection Algorithm Baseline: Greedy Algorithm Select the block with minimum valid pages RGA (Randomized Greedy Algorithm) or d-select: select a random window of size d of pages, then use greedy algorithm within the window to select victim block Random: select victim block randomly

Sensitivity Analysis and Comparison Change allocation strategy: Plane-First allocation Compare with Super-page