Flash-based (cloud) storage systems Lecture 25 Aditya Akella.

Flash-based (cloud) storage systems Lecture 25 Aditya Akella

BufferHash: invented in the context of network de-dup (e.g., inter-DC log transfers) SILT: more “traditional” key-value store

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University of Wisconsin-Madison Suman Nath Microsoft Research

New data-intensive networked systems Large hash tables (10s to 100s of GBs)

New data-intensive networked systems Data center Branch office WAN WAN optimizers Object Object store (~4 TB) Hashtable (~32GB) Look up Object Chunks(4 KB) Key (20 B) Chunk pointer Large hash tables (32 GB) High speed (~10 K/sec) inserts and evictions High speed (~10K/sec) lookups for 500 Mbps link

New data-intensive networked systems Other systems – De-duplication in storage systems (e.g., Datadomain) – CCN cache (Jacobson et al., CONEXT 2009) – DONA directory lookup (Koponen et al., SIGCOMM 2006) Cost-effective large hash tables Cheap Large cAMs

Candidate options DRAM300K$120K + Disk250$30 + Random reads/sec Cost (128 GB) Flash-SSD10K*$225 + Random writes/sec 250 300K 5K* Too slow Too expensive * Derived from latencies on Intel M-18 SSD in experiments 2.5 ops/sec/$ Slow writes How to deal with slow writes of Flash SSD + Price statistics from 2008-09

CLAM design New data structure “BufferHash” + Flash Key features – Avoid random writes, and perform sequential writes in a batch Sequential writes are 2X faster than random writes (Intel SSD) Batched writes reduce the number of writes going to Flash – Bloom filters for optimizing lookups BufferHash performs orders of magnitude better than DRAM based traditional hash tables in ops/sec/$

Flash/SSD primer Random writes are expensive Avoid random page writes Reads and writes happen at the granularity of a flash page I/O smaller than page should be avoided, if possible

Conventional hash table on Flash/SSD Flash Keys are likely to hash to random locations Random writes SSDs: FTL handles random writes to some extent; But garbage collection overhead is high ~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s

Conventional hash table on Flash/SSD DRAM Flash Can’t assume locality in requests – DRAM as cache won’t work

Our approach: Buffering insertions Control the impact of random writes Maintain small hash table (buffer) in memory As in-memory buffer gets full, write it to flash – We call in-flash buffer, incarnation of buffer Incarnation: In-flash hash table Buffer: In-memory hash table DRAMFlash SSD

Two-level memory hierarchy DRAM Flash Buffer Incarnation table Incarnation 1234 Net hash table is: buffer + all incarnations Oldest incarnation Latest incarnation

Lookups are impacted due to buffers DRAM Flash Buffer Incarnation table Lookup key In-flash look ups Multiple in-flash lookups. Can we limit to only one? 4321

Bloom filters for optimizing lookups DRAM Flash Buffer Incarnation table Lookup key Bloom filters In-memory look ups False positive! Configure carefully! 4321 2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

Update: naïve approach DRAM Flash Buffer Incarnation table Bloom filters Update key Expensive random writes Discard this naïve approach 4321

Lazy updates DRAM Flash Buffer Incarnation table Bloom filters Update key Insert key 4321 Lookups check latest incarnations first Key, new value Key, old value

Eviction for streaming apps Eviction policies may depend on application – LRU, FIFO, Priority based eviction, etc. Two BufferHash primitives – Full Discard: evict all items Naturally implements FIFO – Partial Discard: retain few items Priority based eviction by retaining high priority items BufferHash best suited for FIFO – Incarnations arranged by age – Other useful policies at some additional cost Details in paper

Issues with using one buffer Single buffer in DRAM – All operations and eviction policies High worst case insert latency – Few seconds for 1 GB buffer – New lookups stall DRAM Flash Buffer Incarnation table Bloom filters 4321

Partitioning buffers Partition buffers – Based on first few bits of key space – Size > page Avoid i/o less than page – Size >= block Avoid random page writes Reduces worst case latency Eviction policies apply per buffer DRAM Flash Incarnation table 4321 0 XXXXX1 XXXXX

BufferHash: Putting it all together Multiple buffers in memory Multiple incarnations per buffer in flash One in-memory bloom filter per incarnation DRAM Flash Buffer 1Buffer K. Net hash table = all buffers + all incarnations

Latency analysis Insertion latency – Worst case size of buffer – Average case is constant for buffer > block size Lookup latency – Average case Number of incarnations – Average case False positive rate of bloom filter

Parameter tuning: Total size of Buffers.. Total size of buffers = B1 + B2 + … + BN Too small is not optimal Too large is not optimal either Optimal = 2 * SSD/entry DRAM Flash Given fixed DRAM, how much allocated to buffers B1BN # Incarnations = (Flash size/Total buffer size) Lookup #Incarnations * False positive rate False positive rate increases as the size of bloom filters decrease Total bloom filter size = DRAM – total size of buffers

Parameter tuning: Per-buffer size Affects worst case insertion What should be size of a partitioned buffer (e.g. B1) ?.. DRAM Flash B1BN Adjusted according to application requirement (128 KB – 1 block)

SILT: A Memory-Efficient, High-Performance Key-Value Store Hyeontaek Lim, Bin Fan, David G. Andersen Michael Kaminsky† Carnegie Mellon University †Intel Labs 2011-10-24

Key-Value Store 26 Clients PUT(key, value) value = GET(key) DELETE(key) Key-Value Store Cluster E-commerce (Amazon) Web server acceleration (Memcached) Data deduplication indexes Photo storage (Facebook)

SILT goal: use much less memory than previous systems while retaining high performance. 27

Three Metrics to Minimize Memory overhead Read amplification Write amplification Ideally 0 (no memory overhead) Limits query throughput Ideally 1 (no wasted flash reads) Limits insert throughput Also reduces flash life expectancy Must be small enough for flash to last a few years = Index size per entry = Flash reads per query = Flash writes per entry 28

Landscape before SILT Read amplification Memory overhead (bytes/entry) FAWN-DS HashCache BufferHashFlashStore SkimpyStash 29 ?

30 SILT Sorted Index (Memory efficient) SILT Log Index (Write friendly) Solution Preview: (1) Three Stores with (2) New Index Data Structures Memory Flash SILT Filter Inserts only go to Log Data are moved in background Queries look up stores in sequence (from new to old)

LogStore: No Control over Data Layout 6.5+ bytes/entry1 Memory overheadWrite amplification Inserted entries are appended On-flash log Memory Flash 31 SILT Log Index (6.5+ B/entry) (Older)(Newer) Naive Hashtable (48+ B/entry)

SortedStore: Space-Optimized Layout 0.4 bytes/entryHigh On-flash sorted array Memory overheadWrite amplification Memory Flash 32 SILT Sorted Index (0.4 B/entry) Need to perform bulk- insert to amortize cost

Combining SortedStore and LogStore On-flash log SILT Sorted Index Merge 33 SILT Log Index On-flash sorted array

Achieving both Low Memory Overhead and Low Write Amplification 34 SortedStoreLogStore SortedStore LogStore Low memory overhead High write amplification High memory overhead Low write amplification Now we can achieve simultaneously: Write amplification = 5.4 = 3 year flash life Memory overhead = 1.3 B/entry With “HashStores”, memory overhead = 0.7 B/entry!

1.010.7 bytes/entry5.4 Memory overheadRead amplificationWrite amplification 35 SILT’s Design (Recap) On-flash sorted array SILT Sorted Index On-flash log SILT Log Index On-flash hashtables SILT Filter MergeConversion

New Index Data Structures in SILT Partial-key cuckoo hashing For HashStore & LogStore Compact (2.2 & 6.5 B/entry) Very fast (> 1.8 M lookups/sec) 36 SILT Filter & Log Index Entropy-coded tries For SortedStore Highly compressed (0.4 B/entry) SILT Sorted Index

Compression in Entropy-Coded Tries 37 Hashed keys (bits are random)  # red (or blue) leaves ~ Binomial(# all leaves, 0.5)  Entropy coding (Huffman coding and more) (More details of the new indexing schemes in paper) 01 0 00 0 0 0 1 1 1 1 1 1

Landscape Read amplification Memory overhead (bytes/entry) FAWN-DS HashCache BufferHashFlashStore SkimpyStash 38 SILT

BufferHash: Backup

Outline Background and motivation Our CLAM design – Key operations (insert, lookup, update) – Eviction – Latency analysis and performance tuning Evaluation

Configuration – 4 GB DRAM, 32 GB Intel SSD, Transcend SSD – 2 GB buffers, 2 GB bloom filters, 0.01 false positive rate – FIFO eviction policy

BufferHash performance WAN optimizer workload – Random key lookups followed by inserts – Hit rate (40%) – Used workload from real packet traces also Comparison with BerkeleyDB (traditional hash table) on Intel SSD Average latencyBufferHashBerkeleyDB Look up (ms)0.064.6 Insert (ms)0.0064.8 Better lookups! Better inserts!

Insert performance 0.2 0.4 0.6 0.8 1.0 CDF Insert latency (ms) on Intel SSD 99% inserts < 0.1 ms 40% of inserts > 5 ms ! Random writes are slow!Buffering effect!

Lookup performance 0.2 0.4 0.6 0.8 1.0 CDF 99% of lookups < 0.2ms 40% of lookups > 5 ms Garbage collection overhead due to writes! 60% lookups don’t go to Flash 0.15 ms Intel SSD latency Lookup latency (ms) for 40% hit workload

Performance in Ops/sec/$ 16K lookups/sec and 160K inserts/sec Overall cost of $400 42 lookups/sec/$ and 420 inserts/sec/$ – Orders of magnitude better than 2.5 ops/sec/$ of DRAM based hash tables

Other workloads Varying fractions of lookups Results on Trancend SSD Lookup fractionBufferHashBerkeleyDB 00.007 ms18.4 ms 0.50.09 ms10.3 ms 10.12 ms0.3 ms BufferHash ideally suited for write intensive workloads Average latency per operation

Evaluation summary BufferHash performs orders of magnitude better in ops/sec/$ compared to traditional hashtables on DRAM (and disks) BufferHash is best suited for FIFO eviction policy – Other policies can be supported at additional cost, details in paper WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with BerkeleyDB – Details in paper

Related Work FAWN (Vasudevan et al., SOSP 2009) – Cluster of wimpy nodes with flash storage – Each wimpy node has its hash table in DRAM – We target… Hash table much bigger than DRAM Low latency as well as high throughput systems HashCache (Badam et al., NSDI 2009) – In-memory hash table for objects stored on disk

WAN optimizer using BufferHash With BerkeleyDB, throughput up to 10 Mbps With BufferHash, throughput up to 200 Mbps with Transcend SSD – 500 Mbps with Intel SSD At 10 Mbps, average throughput per object improves by 65% with BufferHash

SILT Backup Slides

Evaluation 1.Various combinations of indexing schemes 2.Background operations (merge/conversion) 3.Query latency Experiment Setup CPU2.80 GHz (4 cores) Flash drive SATA 256 GB (48 K random 1024-byte reads/sec) Workload size20-byte key, 1000-byte value, ≥ 50 M keys Query patternUniformly distributed (worst for SILT) 51

LogStore Alone: Too Much Memory 52 Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

LogStore+SortedStore: Still Much Memory 53 Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

Full SILT: Very Memory Efficient 54 Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

Small Impact from Background Operations 33 K 55 40 K Workload: 90% GET (100~ M keys) + 10% PUT Oops! bursty TRIM by ext4 FS

Low Query Latency 56 # of I/O threads Workload: 100% GET (100 M keys) Best tput @ 16 threads Median = 330 μs 99.9 = 1510 μs

Conclusion SILT provides both memory-efficient and high-performance key-value store – Multi-store approach – Entropy-coded tries – Partial-key cuckoo hashing Full source code is available – https://github.com/silt/silt 57

Conventional hash table on Flash/SSD Flash Entry size (20 B) smaller than a page

Flash-based (cloud) storage systems Lecture 25 Aditya Akella.

Similar presentations

Presentation on theme: "Flash-based (cloud) storage systems Lecture 25 Aditya Akella."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flash-based (cloud) storage systems Lecture 25 Aditya Akella.

Similar presentations

Presentation on theme: "Flash-based (cloud) storage systems Lecture 25 Aditya Akella."— Presentation transcript:

Similar presentations

About project

Feedback