Fast Accesses to Big Data in Memory and Storage Systems Xiaodong Zhang Ohio State University
Fast Data Accesses in In-Memory Key-Value Stores
Information processing without tables: Key Value Stores A simple but effective method in data processing where a data record (or a value) is stored and retrieved with its associated key variable types and lengths of record (value) simple or no schema easy software development for many applications Key-value stores have been widely used in production systems:
Simple and Easy Interfaces of Key-Value Stores value = get (key) put (key, value) get (key) 38 John_age … Variable-length keys & values Index Client
Key-Value Store: A Data Model Supporting Scale-out Command Operation GET(key) Read value SET(key, value) Write a KV item DEL(key) Delete a KV item Hash (Key) Server ID finer control over availability. http://www.couchbase.com/nosql-resources/what-is-no-sql , such as relational databases and file systems can be , , demand infrastructures for data collecting, organizing, and management It is horizontally scalable, It removes scalability bottleneck constraining distributed databases and file systems . It can be potentially of (very) high performance. Functionality-richer systems can be build above it. It is horizontally scalable, It can be potentially of (very) high performance. Data Servers 5
Workflow of an in-memory Key-Value Store Network Processing Memory Management Index Operations Access Value
Workflow of a Typical Key-Value Store DELETE SET GET Extract Keys TCP/IP Processing Request Parsing Extract Key&Value Network Processing Delete from Index Insert into Index Search in Index Index Operations Memory Full Not Full Evict Allocate Memory Management Read & Send Value Access Value
Where does time go in KV-Store MICA [NSDI’14] Network processing & Memory Management Index Operations Access Value Index operation becomes one of the major bottlenecks
Data Workflow of Key-Value Stores Query Network Processing & Memory Management Index Operation Hash Table Random Memory Accesses Access Value
Random Memory Accesses of Indexing are Expensive Sequential memory access Random memory access 16 3 7 2 CPU: Intel Xeon E5-2650v2 Memory: 1600 MHz DDR3
Reason: long latency comes from row buffer misses Processor Bus bandwidth time Row Buffer Col. Access (cache line) Row Access Precharge DRAM Latency DRAM Core Row buffer hits: repeated accesses in the same page Row buffer misses: random accesses in different pages
Inabilities for CPUs to accelerate random accesses Do something for Index Operations (random memory accesses) Caching Working set is large (100 GB), CPU cache is small (10 MB) Prefetching Hard to predict next memory address Multithreading Limited number of hardware supported threads Limited by # of Miss Status Holding Registers (MSHRs) for large volumes
High Throughput is Desirable for Processing Big Data It measures the capability of a system to process a growing amount requests on an increasingly large data set Existing KV-Store throughput (MOPS), aiming for low latency MICA (CMU research prototype): 71 MassTree (MIT research prototype): 14 RamCloud (Stanford research prototype): 6 Memcached: (open source software): 1.3 Key-Value Store latency (millisecond) < 1ms is acceptable, e.g. facebook, Amazon, … Our goal: to maximize throughput subject to acceptable latency
To accelerate it by GPUs Mega-KV addresses two issues: large number of requests and random access delay Network Processing Memory Management Access Value Index Operation User space I/O, Multiget, UDP Bitmap, Optimistic concurrent access To accelerate it by GPUs Prefetch
CPU vs. GPU Massive ALUs Control Cache Intel Xeon E5-2650v2: 2.3 billion Transistors 8 Cores 59.7 GB/s memory bandwidth Nvidia GTX 780: 7 billion Transistors 2,304 Cores 288.4 GB/s memory bandwidth
Two Advantages of GPUs for Key-Value Stores Massive Parallel Processing Units in GPU To process a large number of simple and independent memory access operations Massively Hiding Random Memory Access Latency GPU can effectively hide memory latency with massive hardware threads and zero-overhead thread scheduling by hardware support We find that GPUs have several advantages for kv stores. First, it has massive processing units. The operations in kv stores are simple independent memory accesses, while the thousands of cores in GPUs are ideal for massive parallel processing. Second, it has an ability of massively hiding memory access latency. As we have shown previously, lots of random memory accesses are involved in index operations, and GPUs can effectively hide them with …
Two Unique Advantages of GPUs for Key-Value Stores Massive GPU Processing Units To process a large number of simple independent memory access operations Massively Hiding Random Memory Access Latency GPUs can effectively hide memory latency with massive hardware threads and zero-overhead thread scheduling by hardware support memory request issued, switch to another thread memory request issued, switch to another thread cache miss cache miss … Thread A … Thread B … Thread C GPU Core Instruction Buffer
Mega-KV System Framework Pre-Processing GPU Processing Post-Processing
Mega-KV System Framework Network processing, Memory management Index Operations Read & Send Value Pre-Processing GPU Processing Post-Processing Request TX
Mega-KV System Framework Network processing, Memory management Pre-Processing
Mega-KV System Framework Network processing, Memory management Pre-Processing Parallel Processing in GPU
Mega-KV System Framework Network processing, Memory management Pre-Processing Post-Processing Parallel Processing in GPU Read & Send Value
Challenges of Offloading Index Operations to GPUs GPU’s memory capacity is small: ~10 GB Keys may take hundreds of GBs Low PCIe bandwidth PCIe is generally the bottleneck of GPU if large bulk of data needs to be transferred Handling variable-length input is inefficient for GPUs Search along the input buffer or transfer another offset buffer Variable-length string comparison Linked list in GPUs is inefficient The locks to insert/delete linked items would hinder parallelism more random memory accesses
Our Solutions Input data (Keys) Index Compress key C C Compressed fixed-length signatures key GPU optimized cuckoo hash table that stores key signatures and value locations Address challenges 2, 3 (PCIe bandwidth and variable length data) Address challenges 1, 3, and 4 (GPU memory capacity and variable length data)
High Throughput Comes from large Batch Size subject to acceptable latencies
High Throughput Comes from large Batch Size subject to acceptable latencies
High Throughput Comes from large Batch Size subject to acceptable latencies
系统的可扩展性是一个延迟和吞吐量平衡的结果 最佳点 延迟 吞吐量 并发进程的数量
A Significant Execution Time Reduction by Fast Indexing Access Value Index Operations Network Processing & Memory Management Execution time of the query in CPU-based KV Execution time of the query in Mega-KV
Reaching a Record High Throughput 2.1x 1.9x 2.8x We measure the throughput with 95% GET and 5% SET queries under uniform and skewed distribution. With 8B key and 8B value, we can get up to 160MOPS throughput. For all data sets, we are about 2 times as fast as the fastest cpu-based kv stores.
Low and Acceptable Latency Compared with Facebook 1,200 (95th) 300 (50th) 95th: 390 50th: 256 This graph shows the distribution of MegaKV’s round trip latency under its maximum throughput 160 MOPS. The 50th and 95th percentile latency are 256 and 390 microseconds respectively. As that in Facebook is reported to be 300 and 1200 ms. Therefore, our latency is far below the requirement of production systems.
The Mega-KV software is open to the public (VLDB ‘15) Mega-KV is open source at http://kay21s.github.io/megakv/
Summary Restructuring basic data structures to respond big data Reserving buffer cache locality in fast writes: LSbM-tree Balancing network transfer and I/O bandwidth: RCFile/ORC Accelerating indexing operations by GPU: Mega-KV