Memory –efficient Data Management Policy for Flash-based Key-Value Store Wang Jiangtao

Outline Introduction Related work Two works – BloomStore[MSST2012] – TBF[ICDE2013] Summary

Key-Value Store 3 KV store efficiently supports simple operations: Key lookup & KV pair insertion – Online Multi-player Gaming – Data deduplication – Internet services

Overview of Key-Value Store KV store system should provide high access throughput (> 10,000 key lookups/sec) Replaces traditional relational DBs for its superior scalability & performance. – prefer to use KV store for its simplicity and better scalability Popular management (index + storage) solution for large volume of records – often implemented through an index structure, mapping Key-> Value

Challenge To meet high throughput demand, the performance of index access and KV pair (data) access is critical – index access : search the KV pair associated with a given “key” – KV pair access: get/put the actual KV pair Available memory space limits the maximum number of stored KV pairs Using in-RAM index structure can only address index access performance demand

DRAM must be Used Efficiently 6 Per Key-value pair size (bytes) Index size(GB) 32 B( Data deduplication) => 125 GB! 168 B(Tweet) => 24 GB 1 KB(Small image) => 4 GB 1 TB of data 4 bytes of DRAM for key-value pair

Existing Approach to Speed up Index & KV pair Accesses Maintain the index structure in RAM to map each key to its KV pair on SSD – RAM size can not scale up linearly to flash size Keep the minimum index structure in RAM, while storing the rest of the index structure in SSD – On-flash index structure should be designed carefully  Space is precious  random writes are slow and bad for flash life (wear out)

Outline Introduction Related work Two works – BloomStore[MSST2012] – TBF[ICDE2013] Summary

Bloom Filter Bloom Filter 利用位数组表示一个集合,并判断一个元素是否属于这 个集合。初始状态时, m 位的位数组的每一位都置为 0 , Bloom Filter 使 用 k 个相互独立的哈希函数,它们分别将集合中的每个元素映射到 {1,…,m} 的范围中。对任意一个元素 x ,第 i 个哈希函数映射的位置 h i (x) 就会被置为 1 ( 1≤i≤k )。注意,如果一个位置多次被置为 1 ,那么只有 第一次会起作用,后面几次将没有任何效果。 错误率 Bloom Filter 参数选择 – 哈希函数的个数 k 、位数组大小 m 、元素的个数 n – 降低错误率

FlashStore[ VLDB2010 ] Flash as a cache Components – Write buffer – Read cache – Recency bit vector – Disk-presence bloom filter – Hash table index Cons – 6 bytes of RAM per key-value pair

SkimpyStash[ SIGMOD2011 ] Components – Write buffer – Hash table  Bloom filter  using linked list  a pointer to the beginning of the linked list of flash Storing the linked lists on flash – Each pair have a pointer to earlier keys in the log Cons – Multiple flash page reads for a key lookup – High garbage collection cost

Outline Introduction Related work Two works – BloomStore[MSST2012] – TBF[ICDE2013] Summary


Introduction Key lookup throughput is the bottleneck for data application Keep an in-RAM large-sized hash table Move index structure to secondary storage(SSD) – Expensive random write – High garbage collection cost – Bigger storage space

BloomStore Componets – KV Pair write buffer – Active bloom filter  a flash page for write buffer – Bloom filter chain  many flash pages – Key-range partition  a flash “block” BloomStore architecture BloomStore Design – An extremely low amortized RAM overhead – Provide high key lookup/insertion throughput

KV Store Operations Key Lookup – Active Bloom filter – Bloom filter chain – Lookup cost

Parallel lookup h 1 (e i )... Bit-wise AND Bloom filters in parallel Key Lookup – Read the entire BF chain – Bit-wise AND resultant row – High read throughput e i is found

KV Store Operations KV pair Insertion KV pair Update – Append a new key-value pair KV pair Deletion – Insert a null value for the key

Experimental Evaluation Experiment setup – 1TB SSD(PCIe)/32GB(SATA) Workload

Experimental Evaluation Effectiveness of prefilter – Per KV pair is 1.2 bytes Linux Workload V x Workload

Experimental Evaluation Lookup Throughput – Linux Workload  H=96(BF chain length)  m=128(the size of a BF) – V x Workload  H=96(BF chain length)  m=64(the size of a BF)  A prefilter


Motivation Using flash as a extension cache is cost-effective The desired size of RAM-cache is too large – Caching policy is memory-efficient Replacement algorithm achieves comparable performance with existing policies Caching policy is agnostic to the organization of data on SSD

Defects of the existing policy Recency-based caching algotithm – Clock or LRU – Access data structure and index

Defects of the existing policy Recency-based caching algotithm – Clock or LRU – Access data structure and index

System view DRAM buffer – An in-memory data structure to maintain access information (BF) – No special index to locate key- value pair Key-value store – Provide a iterator operation to traverse – Write through Key-Value cache prototype architecture BF

Bloom Filter with deletion(BFD) BFD – Removing a key from SSD – A bloom filter with deletion – Resetting the bits at the corresponding hash-value in a subset of the hash functions X1X Delete X 1

Flow chart Tracking recency information Cons – False positive  polluting the cache – False negative  Poor hit ratio Bloom Filter with deletion(BFD)

Two Bloom sub-Filters(TBF) Flow chart Dropping many elements in bulk Flip the filter periodically Cons – Keeping rarely-accessed objects  polluting the cache – traversal length per eviction

Traversal cost Key-Value Store Traversal – unmarked on insertion – marked on insertion  longer stretches of marked objects  False positive

Experiment setup – two 1 TB 7200 RPM SATA disks in RAID-0 – 80 GB FusionioDrive PCIE X4 – a mixture of 95% read operations and 5% update – Key-value pairs:200 million(256B) Bloom filter – 4 bits per marked object – a byte per object in TBF – hash function:3 Evaluation

Outline Introduction Related work Two works – BloomStore[MSST2012] – TBF[ICDE2013] Summary

KV store is particularly suitable for some special applications Flash will improve the performance of KV store due to its faster access Some index structure need to be redesign to minimize the RAM size Don’t just treat flash as disk replacement 33

Thank You!