Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, Aditya Akella 1 Design Patterns for Tunable and Efficient SSD-based Indexes.

Slides:

Advertisements

Similar presentations

M AINTAINING L ARGE A ND F AST S TREAMING I NDEXES O N F LASH Aditya Akella, UW-Madison First GENI Measurement Workshop Joint work with Ashok Anand, Steven.

Advertisements

Arjun Suresh S7, R College of Engineering Trivandrum.

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Query Processing and Optimizing on SSDs Flash Group Qingling Cao

Memory –efficient Data Management Policy for Flash-based Key-Value Store Wang Jiangtao

SILT: A Memory-Efficient, High-Performance Key-Value Store

A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu 1, Jin Cao 1, Aiyou Chen 1, Patrick P. C. Lee 2 Bell Labs,

1 Fast Routing Table Lookup Based on Deterministic Multi- hashing Zhuo Huang, David Lin, Jih-Kwon Peir, Shigang Chen, S. M. Iftekharul Alam Department.

Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France Plug.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

REDUNDANCY IN NETWORK TRAFFIC: FINDINGS AND IMPLICATIONS Ashok Anand Ramachandran Ramjee Chitra Muthukrishnan Microsoft Research Lab, India Aditya Akella.

Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.

Memory Buddies: Exploiting Page Sharing for Smart Colocation in Virtualized Data Centers Timothy Wood, Gabriel Tarasuk-Levin, Prashant Shenoy, Peter Desnoyers*,

1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.

Flash-based (cloud) storage systems Lecture 25 Aditya Akella.

Lecture 11: DMBS Internals

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Some key-value stores using log-structure Zhichao Liang LevelDB Riak.

Logging in Flash-based Database Systems Lu Zeping

Measuring Control Plane Latency in SDN-enabled Switches Keqiang He, Junaid Khalid, Aaron Gember-Jacobson, Sourav Das, Chaithan Prakash, Aditya Akella,

1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.

Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Efficient Downloading and Updating Application on Smart Cards Yongsu Park, Junyoung Heo, Yookun Cho School of Computer Science and Engineering Seoul National.

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.

Matchmaking: A New MapReduce Scheduling Technique

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Symbiotic Routing in Future Data Centers Hussam Abu-Libdeh Paolo Costa Antony Rowstron Greg O’Shea Austin Donnelly MICROSOFT RESEARCH Presented By Deng.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.

CHAPTER 3-3: PAGE MAPPING MEMORY MANAGEMENT. VIRTUAL MEMORY Key Idea Disassociate addresses referenced in a running process from addresses available in.

Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:

Bigtable: A Distributed Storage System for Structured Data

SILT: A Memory-Efficient, High-Performance Key-Value Store

 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Database System Architecture and Implementation Execution Costs 1 Slides Credit: Michael Grossniklaus – Uni-Konstanz.

Internal Parallelism of Flash Memory-Based Solid-State Drives

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Gwangsun Kim, Jiyun Jeong, John Kim

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

ECE232: Hardware Organization and Design

CPS216: Data-intensive Computing Systems

Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.

Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka

Basic Performance Parameters in Computer Architecture:

FAWN: A Fast Array of Wimpy Nodes

Toward Advocacy-Free Evaluation of Packet Classification Algorithms

Lecture 11: DMBS Internals

Repairing Write Performance on Flash Devices

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Andy Wang Operating Systems COP 4610 / CGS 5765

FAWN: A Fast Array of Wimpy Nodes

CS 3410, Spring 2014 Computer Science Cornell University

Hash Functions for Network Applications (II)

CSE 542: Operating Systems

Sarah Diesburg Operating Systems CS 3430

Lu Tang , Qun Huang, Patrick P. C. Lee

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, Aditya Akella 1 Design Patterns for Tunable and Efficient SSD-based Indexes

Large hash-based indexes 2 WAN optimizers [Anand et al. SIGCOMM ’08] De-duplication systems [Quinlan et al. FAST ‘02] Video Proxy [Anand et al. HotNets ’12] ≈20K lookups and inserts per second (1Gbps link) ≥ 32GB hash table

Use of large hash-based indexes 3 WAN optimizers De-duplication systems Video Proxy Where to store the indexes?

4 SSD 8x less25x less

What’s the problem? Need domain/workload-specific optimizations for SSD-based index with↑ performance and ↓overhead Existing designs have… – Poor flexibility – target a specific point in the cost-performance spectrum – Poor generality – only apply to specific workloads or data structures 5 False assumption!

Our contributions Design patterns that ensure: – High performance – Flexibility – Generality Indexes based on these principles: – SliceHash – SliceBloom – SliceLSH 6

Outline Problem statement Limitations of state-of-the-art SSD architecture Parallelism-friendly design patterns – SliceHash (streaming hash table) Evaluation 7

BufferHash [Anand et al. NSDI ’10] – Designed for high throughput State-of-the-art SSD-based index K,V In-memory incarnation incarnation K A,V A K B,V B K C,V C K A,V A K B,V B K C,V C K#( ) 2 Bloom filter 2 4 bytes per K/V pair! 16 page reads in worst case! (average: ≈1)

SILT [Lim et al. SOSP ‘11] – Designed for low memory + high throughput State-of-the-art SSD-based index 9 LogHash K A,V A K B,V B K C,V C Sorted Hash table K,V Index ≈0.7 bytes per K/V pair 33 page reads in worst case! (average: 1) High CPU usage! Target specific workloads and objectives → poor flexibility and generality Do not leverage internal parallelism

Flash mem package 1 Die 1 Die n 10 Flash mem pkg 126 Flash mem pkg 128 Flash mem pkg 4 Plane 1 Plane 2 Plane 1 Plane 2 Data register Block 1 Page 1 Page 2 Block 2 Page 1 Page 2 SSD controller Channel 1 Channel 32 … … Flash mem pkg 125 … SSD Architecture … How does the SSD architecture inform our design patterns?

Flash memory package 1 Four design principles I.Store related entries on the same page II.Write to the SSD at block granularity III.Issue large reads and large writes IV.Spread small reads across channels 11 Flash memory package 1 Flash memory package 4 Block 2 Channel 1 Channel 32 … … Block 1 Page 1 Flash memory package 4 SliceHash

I. Store related entries on the same page Many hash table incarnations, like BufferHash Incarnation 46: K,V5: K,V 20: K,V1: K,V3: K,V 7: K,V 12: K,V3: K,V 54: K,V6: K,V7: K,V 0: K,V 64: K,V 2: K,V30: K,V1: K,V 7: K,V5: K,V Page 12 Sequential slots from a specific incarnation K#( ) 5 5 Multiple page reads per lookup!

I. Store related entries on the same page Many hash table incarnations, like BufferHash Slicing: store same hash slot from all incarnations on the same page 46: K,V5: K,V 20: K,V1: K,V3: K,V 7: K,V 12: K,V3: K,V 54: K,V6: K,V7: K,V 0: K,V 64: K,V 2: K,V30: K,V1: K,V 7: K,V5: K,V 4 6: K,V 5: K,V 2 0: K,V 1: K,V 3: K,V 7: K,V 1 2: K,V 3: K,V 5 4: K,V 6: K,V 7: K,V 0: K,V 6 4: K,V 2: K,V 3 0: K,V 1: K,V 7: K,V 5: K,V Page Incarnation Slice Only 1 page read per lookup! 13 5 Specific slot from all incarnations

Insert into a hash table incarnation in RAM Divide the hash table so all slices fit into one block : K,V 1: K,V 3: K,V 1 2: K,V 3: K,V 0: K,V 2: K,V 3 0: K,V 1: K,V II. Write to the SSD at block granularity Incarnation K B,V B K C,V C K A,V A K D,V D K E,V E K F,V F 4 6: K,V 5: K,V 7: K,V 5 4: K,V 6: K,V 7: K,V 6 4: K,V 7: K,V 5: K,V Block K B,V B K C,V C K E,V E K A,V A K D,V D K F,V F 14 SliceTable

III. Issue large reads and large writes 15 Package 1 PageReg Package 2 PageReg Package 3 PageReg Page size Channel parallelism Package parallelism Package 4 PageReg Channel 1 Channel 2

III. Issue large reads and large writes SSD assigns consecutive chunks (4 pages/8KB) to different channels 16 Block size Channel parallelism

Read entire SliceTable into RAM Write entire SliceTable onto SSD III. Issues large reads and large writes 4 6: K,V 5: K,V 7: K,V 5 4: K,V 6: K,V 7: K,V 6 4: K,V 7: K,V 5: K,V (Block) 2: K,V2 0: K,V 1: K,V 3: K,V 1 0: K,V 2: K,V 3 0: K,V 1: K,V 2: K,V2 0: K,V 1: K,V 3: K,V 1 0: K,V 2: K,V 3 0: K,V 1: K,V K A,V A K D,V D K F,V F 1: K A,V A 2: K D,V D 3: K F,V F 0 2: K,V2 0: K,V 1: K,V 3: K,V 1 0: K,V 2: K,V 3 0: K,V 1: K,V 1: K A,V A 2: K D,V D 3: K F,V F 0 17

IV. Spread small reads across channels Recall: SSD writes consecutive chunks (4 pages) of a block to different channels – Use existing techniques to reverse engineer [Chen et al. HPCA ‘11] – SSD uses write-order mapping 18 channel for chunk i = i modulo (# channels)

Estimate channel using slot # and chunk size Attempt to schedule 1 read per channel (slot # * pages per slot) modulo (# channels * pages per chunk) ( * pages per slot) modulo (# channels * pages per chunk) IV. Spread small reads across channels 19 Channel 0 Channel 1 Channel 2 Channel

: K,V 1: K,V 3: K,V 1 2: K,V 3: K,V 0: K,V 2: K,V 3 0: K,V 1: K,V SliceHash summary In-memory incarnation K B,V B K C,V C K A,V A K D,V D K E,V E K F,V F 4 6: K,V 5: K,V 7: K,V 5 4: K,V 6: K,V 7: K,V 6 4: K,V 7: K,V 5: K,V Block K B,V B K C,V C K E,V E K A,V A K D,V D K F,V F 20 SliceTable Page Incarnation Slice Specific slot from all incarnations 4 6: K,V 5: K,V 7: K,V 5 4: K,V 6: K,V 7: K,V 6 4: K,V 7: K,V 5: K,V 2 0: K,V 1: K,V 3: K,V 1 2: K,V 3: K,V 0: K,V 2: K,V 3 0: K,V 1: K,V Read/write when updating

Evaluation: throughput vs. overhead GB Crucial M4 2.26Ghz 4-core ↑6.6x ↓12% 8B key 8B value 50% insert 50% lookup ↑2.8x ↑15% See paper for theoretical analysis

Evaluation: flexibility Trade-off memory for throughput 22 50% insert 50% lookup Use multiple SSDs for even ↓ memory use and ↑ throughput

Evaluation: generality Workload may change 23 Memory (bytes/entry) CPU utilization (%) Decreasing! Constantly low!

Summary Present design practices for low cost and high performance SSD-based indexes Introduce slicing to co-locate related entries and leverage multiple levels of SSD parallelism SliceHash achieves 69K lookups/sec (≈12% better than prior works), with consistently low memory (0.6B/entry) and CPU (12%) overhead 24

Evaluation: theoretical analysis Parameters – 16B key/value pairs – 80% table utilization – 32 incarnations – 4GB of memory – 128GB SSD – 0.31ms to read a block – 0.83ms to write a block – 0.15ms to read a page 25 overhead 0.6 B/entry cost avg: ≈5.7μs worst: 1.14ms cost avg & worst: 0.15ms

Evaluation: theoretical analysis 26 overhead 0.6 B/entry cost avg: ≈5.7μs worst: 1.14ms cost avg & worst: 0.15ms BufferHash 4B/entry avg: ≈0.2us worst: 0.83ms avg: ≈0.15ms worst: 4.8ms