1 uFLIP: Understanding Flash IO Patterns Luc Bouganim, INRIA Rocquencourt, France Philippe Bonnet, DIKU Copenhagen, Denmark Björn Þór Jónsson, RU Reykjavík,

Slides:



Advertisements
Similar presentations
Analyzing NFS Client Performance with IOzone
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advanced Piloting Cruise Plot.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
1 Introducing the Specifications of the Metro Ethernet Forum MEF 19 Abstract Test Suite for UNI Type 1 February 2008.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Addition Facts
So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
Chapter 6 File Systems 6.1 Files 6.2 Directories
Making the System Operational
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
SE-292 High Performance Computing
Disk Storage SystemsCSCE430/830 Disk Storage Systems CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine) Fall,
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
Solid State Drive. Advantages Reliability in portable environments and no noise No moving parts Faster start up Does not need spin up Extremely low.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Randomized Algorithms Randomized Algorithms CS648 1.
Paper by: Yu Li, Jianliang Xu, Byron Choi, and Haibo Hu Department of Computer Science Hong Kong Baptist University Slides and Presentation By: Justin.
ABC Technology Project
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Online Algorithm Huaping Wang Apr.21
Cache and Virtual Memory Replacement Algorithms
A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Describing Complex Products as Configurations using APL Arrays.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Chapter 10 Software Testing
Executional Architecture
SIMOCODE-DP Software.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
Addition 1’s to 20.
25 seconds left…...
Week 1.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
©2004 Brooks/Cole FIGURES FOR CHAPTER 12 REGISTERS AND COUNTERS Click the mouse to move to the next page. Use the ESC key to exit this chapter. This chapter.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
PSSA Preparation.
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
Flash storage memory and Design Trade offs for SSD performance
Chapter 14 Writing and Presenting The Systems Proposal
Rethinking Database Algorithms for Phase Change Memory
9. Two Functions of Two Random Variables
Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)
International Conference on Supercomputing June 12, 2009
Solid State Drive Feb 15. NAND Flash Memory Main storage component of Solid State Drive (SSD) USB Drive, cell phone, touch pad…
Origianal Work Of Hyojun Kim and Seongjun Ahn
1/25 Flash Device Support for Database Management Luc Bouganim, INRIA, Paris – Rocquencourt, France Philippe Bonnet, ITU Copenhagen, Denmark CIDR 2011.
연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin
A Lightweight Transactional Design in Flash-based SSDs to Support Flexible Transactions Youyou Lu 1, Jiwu Shu 1, Jia Guo 1, Shuai Li 1, Onur Mutlu 2 LightTx:
 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.
Internal Parallelism of Flash Memory-Based Solid-State Drives
COS 518: Advanced Computer Systems Lecture 8 Michael Freedman
Operating Systems ECE344 Lecture 11: SSD Ding Yuan
Repairing Write Performance on Flash Devices
COS 518: Advanced Computer Systems Lecture 8 Michael Freedman
PARAMETER-AWARE I/O MANAGEMENT FOR SOLID STATE DISKS
Parallel Garbage Collection in Solid State Drives (SSDs)
COS 518: Advanced Computer Systems Lecture 9 Michael Freedman
Presentation transcript:

1 uFLIP: Understanding Flash IO Patterns Luc Bouganim, INRIA Rocquencourt, France Philippe Bonnet, DIKU Copenhagen, Denmark Björn Þór Jónsson, RU Reykjavík, Iceland

2 Flash cells Why should we consider flash devices? NAND flash chip typical timings (SLC chip): Read a 2KB page: RAM buffer A single flash chip could potentially deliver: Read throughput of 23 MB/s, write throughput of 6 MB/s And… Random access is potentially as fast as sequential access! An SSD contains many (e.g., 8, 16) flash chips. Potential parallelism! Flash devices have a high potential Write a 2KB page: Transfer (60 µs), write page (200µs) Erase before rewrite! (2ms for 128 KB) Read page (25 µs), transfer (60 µs)

3 but … Flash chips have many constraints I/O granularity: a flash page (2 KB) No update: Erase before write; erase granularity: a block (64 pages) Writes must be sequential within a flash block Limited lifetime: max 10 5 – 10 6 erase Usually, a software layer (Flash Translation Layer) handle these constraints Flash devices are not flash chips Do not behave as the flash chip they contain No access to the flash chip API but only through the device API Complex architecture and software, proprietary and not documented Flash devices are black boxes ! How can we model flash devices? First step: understand their performance Need for a benchmark. Need for a benchmark.

4 Read Write Erase RAM R/W Free Block Unfilled block Filled block When possible redirect writes to previously erased locations Update blocks The Flash Translation Layer Emulates a normal block device, handling flash constraints RAM FLASH FTL Blocks Update Blocks FTL MAP Other FTL structures …): …): …): …): D R R R FF D R FFDDDDDDDDDDDDDDD D …) Maps logical address (LBA) to physical locations Mapping information Distributes erase across the device (wear levelling) Other data structures FFDD …) Flash Device Flash Translation Layer API: Read(LBA, &data) / Write (LBA, data) … … ……… … IO Cost thus depends on The mode of the IO (i.e., read, write) Recent IOs (caching in the device RAM) The device state (i.e., flash state and data structures) Device state depends on Entire history of previous IO requests

5 Why do we need to benchmark flash devices? DB technology relies on the HD characteristics … … Flash devices will replace or complement HDs … … and we have a poor knowledge of flash devices. Flash devices are black boxes (complex and undocumented FTLs) Large range from USB flash drives to high performance flash board. Benchmarking flash devices is difficult: Need to design a sound benchmarking methodology – IO cost is highly variable and depends on the whole device history! Need to define a broad benchmark – No safe assumption can be made on the device behavior (black box) – Moreover, we do not want to restrict the benchmark usage! Benchmarking flash devices: Goal and difficulties

6 Methodology (1): Device state Measuring Samsung SSD RW performance Out-of-the-box … Random Writes – Samsung SSD Out of the box

7 Methodology (1): Device state Measuring Samsung SSD RW performance Out-of-the-box … and after filling the device!!! (similar behavior on Intel SSD) Random Writes – Samsung SSD Out of the box Random Writes – Samsung SSD After filling the device

8 Methodology (2): Startup and running phases When do we reach a steady state? How long to run each test? Startup and running phases for the Mtron SSD (RW) Running phase for the Kingston DTI flash Drive (SW)

9 Methodology (3): Interferences between consecutive runs Setup experiment for the Mtron SSD

10 Proposed methodology: Device state : Enforce a well-defined device state performing random write IOs of random size on the whole device The alternative, sequential IOs, is less stable, thus more difficult to enforce Startup and running phase: Run experiments to define IOIgnore: Number of IOs ignored when computing statistics IOCount: Number of measures to allow for convergence of those statistics. Interferences: Introduce a pause between runs Run the following experiment: SR, then RW, then SR (with a large IOCount) Measure the interferences. In previous experiment, 3000 IOs Overestimate the length of the Pause

11 uFLIP (1): Basic construct: IO Pattern An IO Pattern is a sequence of IOs An IO is defined by 4 attributes (time, size, LBA, mode) Baseline Patterns (Seq. Read, Random Read, Seq. Write, Random Write) More patterns by using parameterized functions for each attribute Consecutive Sequential Consecutive Random Pause Sequential Burst Sequential OrderedPartitioned time Consecutive Pause (Pause) Burst (Pause,Burst) sizesize (Size)LBA Sequential Random Ordered (Incr) Partitioned (Partitions) mode Read Write

12 uFLIP (1): Basic construct: IO Pattern An IO Pattern is a sequence of IOs An IO is defined by 4 attributes (time, size, LBA, mode) Baseline Patterns (Seq. Read, Random Read, Seq. Write, Random Write) More patterns by using parameterized functions for each attribute Potentially relevant IO patterns Basic patterns: one function for each attribute Mixed patterns: combining basic patterns Parallel patterns: replicating a basic pattern or mixing in // basic patterns Problems: IO patterns space is too large! Mixed and parallel patterns may be too complex to analyze time Consecutive Pause (Pause) Burst (Pause,Burst) sizesize (Size)LBA Sequential Random Ordered (Incr) Partitioned (Partitions) mode Read Write

13 uFLIP (2): What is a uFLIP micro-benchmark An execution of a reference pattern is a run. Measure the response time for individual IOs Compute statistics (min, max, mean, standard deviation) to summarize it. A collection of runs of the same pattern is an experiment Restriction to a single varying parameter for sound analysis A collection of related experiments is a micro-benchmark Defined over the baseline patterns with the same varying parameter 9 varying parameters 9 micro-benchmarks Basic patterns: IOSize, IOShift, TargetSize, Partitions, Incr, Pause, Burst Mixed patterns:Ratio (mix only two baseline patterns) Parallel patterns: ParallelDegree (replicate in // each baseline pattern) {{{IO}}}{{run}}{experiment}Micro-benchmark

14 uFLIP (3): The 9 micro-benchmarks 1GranularityIOSizeBasic performance? Device latency? 2AlignmentIOShiftPenalty for badly aligned IOs? 3Locality Target Size IOs focused on a reduced area? 4PartitioningPartitionsIOs in several partitions? 5OrderIncrReverse pattern, In place pattern, IOs with gaps? 6Parallelism Parallel Degree IOs in parallel? 7MixRatioMixing two baseline patterns? 8Pause Device capacity to benefit from idle periods? 9BurstsBurstAsynchronous overhead accumulation in time?

15 Results Locality for the Samsung, Memoright and Mtron SSDs When limited to a focused area, RW performs very well For SR, SW and RR, linear behavior, almost no latency good throughputs with large IO Size For RW, 5ms for a 16KB-128KB IO Granularity for the Memoright SSD

16 Results: summary SR, RR and SW are very efficient Flash devices incur large latency for RW Random writes should be limited to a focused area Sequential writes should be limited to a few partitions Good support for reverse and in place patterns. Surprisingly, no device supports parallel IO submission

17 Conclusion The uFLIP benchmark Sound methodology: Device preparation & setup stable measurements Broad: 9 micro benchmarks, 39 experiments Detailed: 1400 runs, 1 to 5 million I/Os... for a single device! Simple: an experiment = a 2-dimensional graph Publicly available: First results: Flash devices exhibit similar behaviors Despite their differences in cost / complexity / interface Current & Future work Short term: Visualization tool, with several levels of summarization Enhance the software: setup parameters, benchmark duration,... Exploit the benchmark results!

18 Experiment color analysis Run IO

19 Details

20 Conclusion The uFLIP benchmark Sound methodology: Device preparation & setup stable measurements Broad: 9 micro benchmarks, 39 experiments Detailed: 1400 runs, 1 to 5 million I/Os... for a single device! Simple: an experiment = a 2-dimensional graph Publicly available: First results: Flash devices exhibit similar behaviors Despite their differences in cost / complexity / interface Current & Future work Short term: Visualization tool, with several levels of summarization Enhance the software: setup parameters, benchmark duration,... Exploit the benchmark results!

21 Questions?

22

23 Selecting a device

24 Device result summary Partitioning RW Order RW RR/RWSW/RW Mix SR/RRSR/SWSR/RWRR/SW Granularity SRRRSWRW Alignment SRRRSWRW Locality SRRRSWRW Parallelism SRRRSWRW Pause SRRRSWRW Bursts SRRRSWRW Experiment marked as interesting Experiment marked as not interesting Not performed

25 Experiment analysis

26 Experiment color analysis Run IO

27 Details