MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt – The University of British Columbia
Problem & Motivation Data centers consume significant amounts of power MemcachedGPU - SoCC'151
Problem & Motivation Data centers consume significant amounts of power Continuously growing demand for higher performance Horizontal or vertical scaling – GP-GPUs MemcachedGPU - SoCC'152
Why GPUs? Highly parallel High energy-efficiency – Green500: GPUs in 7 of top 10 most energy-efficient super computers General-purpose & programmable MemcachedGPU - SoCC'153 CPU GPU
Highlights Network and Memcached processing on GPUs 10 GbE line-rate at all request sizes 95% latency < % peak throughput 75% energy-efficiency of FPGA Maintain Memcached QoS with other workloads MemcachedGPU - SoCC'154
GPU Network Offload Manager (GNoM) Packet metadata Network Card CPU Kernel Module & Network Driver OS Pre-processing Post-processing User-level MemcachedGPU - SoCC'155 Networking Application GPU Packet data Response & Recycle Receive Send
Challenges | Networking on GPUs High throughput – Efficient data movement – Request-level parallelism through batching Low latency – Small batches – Multiple concurrent batches – Task-level parallelism MemcachedGPU - SoCC'156
Application | Memcached MemcachedGPU - SoCC'157 Web Tier Memcached Distributed Key-value Store Memcached Distributed Key-value Store Storage Tier GET SET
Challenges | MemcachedGPU Limited GPU memory sizes MemcachedGPU - SoCC'158 Key & Value Storage Hash Table CPU Memory GPU Memory CPU Memory Hash Table + Key storage Value Storage
Challenges | MemcachedGPU Dynamic memory allocation – Dynamic hash chaining Reduce GET serialization MemcachedGPU - SoCC'159 Hash Table Static set-associative Set 0 Set 1 Set N
Experimental Methodology Single client-server setup with 10 GbE NIC High-performance NVIDIA Tesla K20c GPU – Kepler | TDP = 225W | # Cores = 2496 |Cost = $2700 Low-power NVIDIA GTX 750 Ti GPU – Maxwell | TDP = 60W | # Cores = 640 | Cost = $150 MemcachedGPU - SoCC'1510
Evaluation| Throughput MemcachedGPU - SoCC'1511
Evaluation| Latency MemcachedGPU - SoCC'1512
Evaluation| Power MemcachedGPU - SoCC'1513 High-performance GPU 225W TDP
Evaluation| Energy-efficiency MemcachedGPU - SoCC'1514
Evaluation| Workload Consolidation MemcachedGPU - SoCC'1515 Limited multiprogramming on current GPUs GPU Low-priority background task Memcached Blocked
Evaluation| Workload Consolidation 18X maximum request latency 50% low-priority background runtime MemcachedGPU - SoCC'1516 Background task running
Conclusions Network and Memcached processing on GPUs 10 GbE line-rate at all request sizes 95% latency < % peak throughput 75% energy-efficiency of FPGA Maintain Memcached QoS with other workloads MemcachedGPU - SoCC'1517 Code: