Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Slides:

Advertisements

Similar presentations

M AINTAINING L ARGE A ND F AST S TREAMING I NDEXES O N F LASH Aditya Akella, UW-Madison First GENI Measurement Workshop Joint work with Ashok Anand, Steven.

Advertisements

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

SILT: A Memory-Efficient, High-Performance Key-Value Store

S M Faisal* Hash in a Flash: Hash Tables for Solid State Devices Tyler Clemons*Shirish Tatikonda ‡ Charu Aggarwal † Srinivasan Parthasarathy* *The Ohio.

FlashVM: Virtual Memory Management on Flash Mohit Saxena and Michael M. Swift Introduction Flash storage is the largest change to memory and storage systems.

Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, Aditya Akella 1 Design Patterns for Tunable and Efficient SSD-based Indexes.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

International Conference on Supercomputing June 12, 2009

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.

G Robert Grimm New York University Sprite LFS or Let’s Log Everything.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.

G Robert Grimm New York University Sprite LFS or Let’s Log Everything.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.

NET-REPLAY: A NEW NETWORK PRIMITIVE Ashok Anand Aditya Akella University of Wisconsin, Madison.

Systems I Locality and Caching

Flash-based (cloud) storage systems Lecture 25 Aditya Akella.

A Locality Preserving Decentralized File System Jeffrey Pang, Suman Nath, Srini Seshan Carnegie Mellon University Haifeng Yu, Phil Gibbons, Michael Kaminsky.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Memory Management From Chapter 4, Modern Operating Systems, Andrew S. Tanenbaum.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

RAPID-Cache – A Reliable and Inexpensive Write Cache for Disk I/O Systems Yiming Hu Qing Yang Tycho Nightingale.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.

Chapter Twelve Memory Organization

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.

1 Shared Files Sharing files among team members A shared file appearing simultaneously in different directories Share file by link File system becomes.

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Amar Phanishayee,LawrenceTan,Vijay Vasudevan

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

CS333 Intro to Operating Systems Jonathan Walpole.

Bits Eugene Wu, Carlo Curino, Sam Madden

CSE 451: Operating Systems Spring 2012 Module 16 BSD UNIX Fast File System Ed Lazowska Allen Center 570.

The Evicted-Address Filter

1 Lecture 20: Big Data, Memristors Today: architectures for big data, memristors.

Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 7 – Buffer Management.

Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,

Select Operation Strategies And Indexing (Chapter 8)

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

Jonathan Walpole Computer Science Portland State University

Lecture: Large Caches, Virtual Memory

Basic Performance Parameters in Computer Architecture:

Database Management Systems (CS 564)

An Adaptive Data Separation Aware FTL for Improving the Garbage Collection Efficiency of Solid State Drives Wei Xie and Yong Chen Texas Tech University.

Operating Systems ECE344 Lecture 11: SSD Ding Yuan

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Performance metrics for caches

\course\cpeg324-05F\Topic7c

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University of Wisconsin-Madison Suman Nath Microsoft Research

New data-intensive networked systems Large hash tables (10s to 100s of GBs)

New data-intensive networked systems Data center Branch office WAN WAN optimizers Object Object store (~4 TB) Hashtable (~32GB) Look up Object Chunks(4 KB) Key (20 B) Chunk pointer Large hash tables (32 GB) High speed (~10 K/sec) inserts and evictions High speed (~10K/sec) lookups for 500 Mbps link

New data-intensive networked systems Other systems – De-duplication in storage systems (e.g., Datadomain) – CCN cache (Jacobson et al., CONEXT 2009) – DONA directory lookup (Koponen et al., SIGCOMM 2006) Cost-effective large hash tables Cheap Large cAMs

Candidate options DRAM300K$120K + Disk250$30 + Random reads/sec Cost (128 GB) Flash-SSD10K*$225 + Random writes/sec K 5K* Too slow Too expensive * Derived from latencies on Intel M-18 SSD in experiments 2.5 ops/sec/$ Slow writes How to deal with slow writes of Flash SSD + Price statistics from

Our CLAM design New data structure “BufferHash” + Flash Key features – Avoid random writes, and perform sequential writes in a batch Sequential writes are 2X faster than random writes (Intel SSD) Batched writes reduce the number of writes going to Flash – Bloom filters for optimizing lookups BufferHash performs orders of magnitude better than DRAM based traditional hash tables in ops/sec/$

Outline Background and motivation CLAM design – Key operations (insert, lookup, update) – Eviction – Latency analysis and performance tuning Evaluation

Flash/SSD primer Random writes are expensive Avoid random page writes Reads and writes happen at the granularity of a flash page I/O smaller than page should be avoided, if possible

Conventional hash table on Flash/SSD Flash Keys are likely to hash to random locations Random writes SSDs: FTL handles random writes to some extent; But garbage collection overhead is high ~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s

Conventional hash table on Flash/SSD DRAM Flash Can’t assume locality in requests – DRAM as cache won’t work

Our approach: Buffering insertions Control the impact of random writes Maintain small hash table (buffer) in memory As in-memory buffer gets full, write it to flash – We call in-flash buffer, incarnation of buffer Incarnation: In-flash hash table Buffer: In-memory hash table DRAMFlash SSD

Two-level memory hierarchy DRAM Flash Buffer Incarnation table Incarnation 1234 Net hash table is: buffer + all incarnations Oldest incarnation Latest incarnation

Lookups are impacted due to buffers DRAM Flash Buffer Incarnation table Lookup key In-flash look ups Multiple in-flash lookups. Can we limit to only one? 4321

Bloom filters for optimizing lookups DRAM Flash Buffer Incarnation table Lookup key Bloom filters In-memory look ups False positive! Configure carefully! GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

Update: naïve approach DRAM Flash Buffer Incarnation table Bloom filters Update key Expensive random writes Discard this naïve approach 4321

Lazy updates DRAM Flash Buffer Incarnation table Bloom filters Update key Insert key 4321 Lookups check latest incarnations first Key, new value Key, old value

Eviction for streaming apps Eviction policies may depend on application – LRU, FIFO, Priority based eviction, etc. Two BufferHash primitives – Full Discard: evict all items Naturally implements FIFO – Partial Discard: retain few items Priority based eviction by retaining high priority items BufferHash best suited for FIFO – Incarnations arranged by age – Other useful policies at some additional cost Details in paper

Issues with using one buffer Single buffer in DRAM – All operations and eviction policies High worst case insert latency – Few seconds for 1 GB buffer – New lookups stall DRAM Flash Buffer Incarnation table Bloom filters 4321

Partitioning buffers Partition buffers – Based on first few bits of key space – Size > page Avoid i/o less than page – Size >= block Avoid random page writes Reduces worst case latency Eviction policies apply per buffer DRAM Flash Incarnation table XXXXX1 XXXXX

BufferHash: Putting it all together Multiple buffers in memory Multiple incarnations per buffer in flash One in-memory bloom filter per incarnation DRAM Flash Buffer 1Buffer K. Net hash table = all buffers + all incarnations

Outline Background and motivation Our CLAM design – Key operations (insert, lookup, update) – Eviction – Latency analysis and performance tuning Evaluation

Latency analysis Insertion latency – Worst case size of buffer – Average case is constant for buffer > block size Lookup latency – Average case Number of incarnations – Average case False positive rate of bloom filter

Parameter tuning: Total size of Buffers.. Total size of buffers = B1 + B2 + … + BN Too small is not optimal Too large is not optimal either Optimal = 2 * SSD/entry DRAM Flash Given fixed DRAM, how much allocated to buffers B1BN # Incarnations = (Flash size/Total buffer size) Lookup #Incarnations * False positive rate False positive rate increases as the size of bloom filters decrease Total bloom filter size = DRAM – total size of buffers

Parameter tuning: Per-buffer size Affects worst case insertion What should be size of a partitioned buffer (e.g. B1) ?.. DRAM Flash B1BN Adjusted according to application requirement (128 KB – 1 block)

Outline Background and motivation Our CLAM design – Key operations (insert, lookup, update) – Eviction – Latency analysis and performance tuning Evaluation

Configuration – 4 GB DRAM, 32 GB Intel SSD, Transcend SSD – 2 GB buffers, 2 GB bloom filters, 0.01 false positive rate – FIFO eviction policy

BufferHash performance WAN optimizer workload – Random key lookups followed by inserts – Hit rate (40%) – Used workload from real packet traces also Comparison with BerkeleyDB (traditional hash table) on Intel SSD Average latencyBufferHashBerkeleyDB Look up (ms) Insert (ms) Better lookups! Better inserts!

Insert performance CDF Insert latency (ms) on Intel SSD 99% inserts < 0.1 ms 40% of inserts > 5 ms ! Random writes are slow!Buffering effect!

Lookup performance CDF 99% of lookups < 0.2ms 40% of lookups > 5 ms Garbage collection overhead due to writes! 60% lookups don’t go to Flash 0.15 ms Intel SSD latency Lookup latency (ms) for 40% hit workload

Performance in Ops/sec/$ 16K lookups/sec and 160K inserts/sec Overall cost of $ lookups/sec/$ and 420 inserts/sec/$ – Orders of magnitude better than 2.5 ops/sec/$ of DRAM based hash tables

Other workloads Varying fractions of lookups Results on Trancend SSD Lookup fractionBufferHashBerkeleyDB ms18.4 ms ms10.3 ms ms0.3 ms BufferHash ideally suited for write intensive workloads Average latency per operation

Evaluation summary BufferHash performs orders of magnitude better in ops/sec/$ compared to traditional hashtables on DRAM (and disks) BufferHash is best suited for FIFO eviction policy – Other policies can be supported at additional cost, details in paper WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with BerkeleyDB – Details in paper

Related Work FAWN (Vasudevan et al., SOSP 2009) – Cluster of wimpy nodes with flash storage – Each wimpy node has its hash table in DRAM – We target… Hash table much bigger than DRAM Low latency as well as high throughput systems HashCache (Badam et al., NSDI 2009) – In-memory hash table for objects stored on disk

Conclusion We have designed a new data structure BufferHash for building CLAMs Our CLAM on Intel SSD achieves high ops/sec/$ for today’s data-intensive systems Our CLAM can support useful eviction policies Dramatically improves performance of WAN optimizers

Thank you

ANCS 2010: ACM/IEEE Symposium on Architectures for Networking and Communications Systems Estancia La Jolla Hotel & Spa (near UCSD) October 25-26, 2010 Paper registration & abstract: May 10, 2010 Submission deadline: May 17,

Backup slides

WAN optimizer using BufferHash With BerkeleyDB, throughput up to 10 Mbps With BufferHash, throughput up to 200 Mbps with Transcend SSD – 500 Mbps with Intel SSD At 10 Mbps, average throughput per object improves by 65% with BufferHash

Conventional hash table on Flash/SSD Flash Entry size (20 B) smaller than a page