Internal Parallelism of Flash Memory-Based Solid-State Drives

Slides:



Advertisements
Similar presentations
Solid State Drive. Advantages Reliability in portable environments and no noise No moving parts Faster start up Does not need spin up Extremely low.
Advertisements

Paper by: Yu Li, Jianliang Xu, Byron Choi, and Haibo Hu Department of Computer Science Hong Kong Baptist University Slides and Presentation By: Justin.
Flash storage memory and Design Trade offs for SSD performance
Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)
0 秘 Type of NAND FLASH Discuss the Differences between Flash NAND Technologies: SLC :Single Level Chip MLC: Multi Level Chip TLC: Tri Level Chip Discuss:
International Conference on Supercomputing June 12, 2009
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.
Disk and I/O Management
Solid State Drive Feb 15. NAND Flash Memory Main storage component of Solid State Drive (SSD) USB Drive, cell phone, touch pad…
Operating Systems CMPSC 473 I/O Management (2) December Lecture 24 Instructor: Bhuvan Urgaonkar.
Slide 1 Windows PC Accelerators Reporter :吳柏良. Slide 2 Outline l Introduction l Windows SuperFetch l Windows ReadyBoost l Windows ReadyDrive l Conclusion.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Understanding Intrinsic Characteristics and System Implications of Flash Memory based Solid State Drives Feng Chen, David A. Koufaty, and Xiaodong Zhang.
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
I/O 1 Computer Organization II © McQuain Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner:
Embedded System Lab. 서동화 HIOS: A Host Interface I/O Scheduler for Solid State Disk.
Seminar on Linux-based embedded systems
Improving Network I/O Virtualization for Cloud Computing.
SOLID STATE DRIVES By: Vaibhav Talwar UE84071 EEE(5th Sem)
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Lecture 16: Storage and I/O EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
A Case for Flash Memory SSD in Enterprise Database Applications Authors: Sang-Won Lee, Bongki Moon, Chanik Park, Jae-Myung Kim, Sang-Woo Kim Published.
I/O Computer Organization II 1 Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate:
Computer Organization CS224 Fall 2012 Lessons 47 & 48.
PROBLEM STATEMENT A solid-state drive (SSD) is a non-volatile storage device that uses flash memory rather than a magnetic disk to store data. SSDs provide.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
A Semi-Preemptive Garbage Collector for Solid State Drives
Full and Para Virtualization
연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin
대용량 플래시 SSD의 시스템 구성, 핵심기술 및 기술동향
 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
CS422 Principles of Database Systems Disk Access Chengyu Sun California State University, Los Angeles.
Native Command Queuing (NCQ). NCQ is used to improve the hard disc performance by re-ordering the commands send by the computer to the hard disc drive.
Persistent Memory (PM)
COS 518: Advanced Computer Systems Lecture 8 Michael Freedman
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Chapter 10: Mass-Storage Systems
Storage Devices CS 161: Lecture 11 3/21/17.
Database Applications (15-415) DBMS Internals- Part I Lecture 11, February 16, 2016 Mohammad Hammoud.
Understanding Modern Flash Memory Systems
Reducing Memory Interference in Multicore Systems
Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.
The (Solid) State Of Drive Technology
An Adaptive Data Separation Aware FTL for Improving the Garbage Collection Efficiency of Solid State Drives Wei Xie and Yong Chen Texas Tech University.
Operating Systems ECE344 Lecture 11: SSD Ding Yuan
The Composite-File File System: Decoupling the One-to-one Mapping of Files and Metadata for Better Performance Shuanglong Zhang, Helen Catanese, Andy An-I.
Introduction I/O devices can be characterized by I/O bus connections
MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose,
Disk scheduling In multiprogramming systems several different processes may want to use the system's resources simultaneously. The disk drive needs some.
Chapter 12: Mass-Storage Systems
COS 518: Advanced Computer Systems Lecture 8 Michael Freedman
Arash Tavakkol, Mohammad Sadrosadati, Saugata Ghose,
Storage Systems Sudhanva Gurumurthi.
Mass-Storage Systems.
PARAMETER-AWARE I/O MANAGEMENT FOR SOLID STATE DISKS
Chapter 2: The Linux System Part 5
Parallel Garbage Collection in Solid State Drives (SSDs)
Linux Block I/O Layer Chris Gill, Brian Kocoloski
THE GOOGLE FILE SYSTEM.
Lecture 5 Memory and storage
COS 518: Advanced Computer Systems Lecture 9 Michael Freedman
CS 295: Modern Systems Storage Technologies Introduction
Introduction to Operating Systems
Dong Hyun Kang, Changwoo Min, Young Ik Eom
Design Tradeoffs for SSD Performance
Presentation transcript:

Internal Parallelism of Flash Memory-Based Solid-State Drives Presented by: Nafiseh Mahmoudi Spring 2017

Paper Information Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Type: Research Paper

High speed data processing demands high storage I/O performance. Flash memory based SSD, with no moving part have received strong interest. Internal parallelism is a unique and rich resource embedded inside SSDs.

Internal parallelism in SSD SSDs have several inherent architectural limitations: one single flash memory package can only provide limited bandwidth (e.g., 32 to 40MB/sec), which is unsatisfactory for practical use as a high-performance storage. Writes in flash memory are often much slower than reads. FTL running at the device firmware level needs to perform several critical operations. These internal operations can incur a high overhead. High parallelizing design yields 2 benefits: Transferring data in parallel can provide high bandwidth High latency operations can be hidden behind other concurrent operations.

CONTENTS Critical issues for investigation The background introduces our experimental system and methodology Detect the SSD internals Present our experimental and case studies on SSDs. The interaction with external parallelism

Critical issues for investigation Strive to answer several question To fully understand internal parallelism of SSDs Gain insight into unique resource of SSDs Reveal some untold facts and unexpected dynamics in SSDs

BACKGROUND OF SSD ARCHITECTURE NAND Flash Memory Flash Translation Layer (FTL) SSD Architecture and Internal Parallelism Command Queuing

1. NAND Flash Memory Nand flash classified into : SLC: stores only on bit MLC: can store 2 bits or even more in one cell TLC: stores 3 bits Nand flash memory package composed of : Each page has : A data part (4-8KB) Associated metadata area (128 bytes) Used for storing ECC Flash memory has 3 major operation : Read, write performed in units of pages Erase

2. Flash Translation Layer (FTL) FTL is implemented to emulate a hard disk. Three key roles of FTL: logical block mapping: maps LBA to PBA garbage collection: handles the no-in-place-write issue wear leveling: shuffles cold blocks with hot blocks

3. SSD Architecture and Internal Parallelism SSD includes four major components: 1. host interface logic 2.Controller 3.Ram 4.Flash memory controller Different levels of parallelism: Channel-Level Parallelism Package-Level Parallelism Die level parallelism Plane level parallelism

4. Command Queuing Native command queuing(NCQ): a feature introduced by SATA II The device can accept multiple incoming commands and schedule the jobs internally Enables SSDs to achieve real parallel data access internally

Critical issues for investigation Background of SSD architecture Introduces our experimental system and methodology Detect the SSD internals present our experimental and case studies on SSDs. The interaction with external parallelism

MEASUREMENT ENVIRONMENT Experimental Systems Experimental Tools and Workloads Trace Collection

1. Experimental Systems experimental studies have been conducted on a Dell Precision T3400. It is equipped with an Intel Core 2Duo E7300 2.66GHz processor and 4GB main memory. A 250GB Seagate 7200RPM hard drive is used for maintaining the OS and home directories . Use Fedora Core 9 with the Linux Kernel 2.6.27 and Ext3 file system. Select two representative, the two SSDs target different markets One is built on MLC flash memories and designed for the mass market The other is a high-end product built on faster and more durable SLC

2. Experimental Tools and Workloads For studying the internal parallelism of SSDs Use two tools (Intel Open Storage Toolkit, custom-built tool), to generate workloads with the following three access patterns. Sequential: using a specified request size, starting from sector 0. Random: Blocks are randomly selected from the first 1,024MB of the storage Stride: starting from sector 0 with a stride distance 2 consecutive data access Each workloads runs for 30 seconds in default to limit trace size while collecting sufficient data. All workloads directly access the SSDs as raw block devices

3. Trace collection In order to analyze I/O traffic, use: Blktrace to trace the I/O activity The tool intercepts I/O events in the OS kernel such as Queuing Dispatching Completion Completion event: Report the latency for each individual request The trace data are first collected in memory and then copied in HDD To minimize the interference caused by tracing

Detect the SSD internals Critical issues for investigation Background of SSD architecture Introduces our experimental system and methodology Detect the SSD internals present our experimental and case studies on SSDs. The interaction with external parallelism

4. UNCOVERING SSD INTERNALS 1. Internal parallelism is an architecture-dependent resource. 2. Present a set of experimental approaches to expose the SSD internals. A Generalized Model Chunk Size Interleaving Degree The Mapping Policy

Stripe: A set of chunks across each of N domains are called stirpe Domain: is a set of flash memories that share a specific set of resources. (e.g. channels). A domain can be further partitioned into sub-domains (e.g. packages) Chunk: A chunk is a unit of data that is continuously allocated within on domain. Chunks are interleavingly placed over set of N domains by following a mapping policy. Chunk size: the size of the largest unit of data that is continuously mapped within individual domain. Stripe: A set of chunks across each of N domains are called stirpe

A general goal is to minimize resource sharing and maximize resource utilization, three key factors directly determine the internal parallelism. Chunk size: the size of the largest unit of data that is continuously mapped within an individual domain. Interleaving degree: the number of domains at the same level. The interleaving degree is essentially determined by the resource redundancy (e.g., channels). Mapping policy: the method that determines the domain to which a chunk of logical data is mapped. This policy determines the physical data layout.

present our experimental and case studies on SSDs. Critical issues for investigation Background of SSD architecture Introduces our experimental system and methodology Detect the SSD internals present our experimental and case studies on SSDs. The interaction with external parallelism

PERFORMANCE STUDIES What Are the Benefits of Parallelism? How Do Parallel Reads and Writes Interfere with Each Other and Cause Performance Degradation? How Does I/O Parallelism Impact the Effectiveness of Readahead? How Does an Ill-Mapped Data Layout Impact I/O Parallelism? Readahead on Ill-Mapped Data Layout

1. Benefit of I/O parallelism in SSD Run four workloads with different access patterns Sequential read Sequential write Random read Random write

Small and random reads yield the most significant performance gains. A large request benefits less from parallelism because the continuous logical blocks are often striped across domains and it already benefits from internal parallelism In order to exploit effectively internal parallelism, we can either increase request sizes or parallelize small requests Highly parallelized small random access can achieve performance comparable to large sequential access without parallelism Writes on SSD-M, quickly reach the peak bandwidth Less difference on the two SSDs for reads.

The performance potential of increasing I/O parallelism is physically limited by the redundancy of available resources when the queue depth reaches over 8–10 jobs, further increasing parallelism receives diminishing benefits. When the queue depth exceeds 10, a channel has to be shared by more than 1 job. Over-parallelizing does not result in undesirable negative effects The performance benefits of parallelizing I/O also depends on the flash memory mediums MLC and SLC flash memories provide performance difference, especially for writes. MLC-based lower-end SSD, quickly reach the peak bandwidth (only about 80MB/sec) the SLC-based higher-end SSD, shows much higher peak bandwidth (around 200MB/sec) and more headroom for parallelizing small writes.

Parallel Reads and Writes Interfere Both operations share many critical resources (ECC) Parallel jobs accessing such resources need to be serialized Reads and writes have a strong interference with each other The significance of this interference highly depends on read access patterns

I/O parallelism impact on effectiveness of readahead sequential reads on SSD can outperform random reads with no parallelism because modern SSDs implement a readahead mechanism at the device level to detect sequential data accesses and prefetch data into the on-device buffer parallelizing multiple sequential read streams could result in a sequence of mingled reads, which can affect the detection of sequential patterns and invalidate readahead. There is a tradeoff between increasing parallelism and retaining effective readahead.

Thank you