MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

Slides:

Advertisements

Similar presentations

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

st International Conference on Parallel Processing (ICPP)

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Panda: MapReduce Framework on GPU’s and CPU’s

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

Supporting GPU Sharing in Cloud Environments with a Transparent

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

An Architectural Evaluation of SDN Controllers Syed Abdullah Shah, Jannet Faiz, Maham Farooq, Aamir Shafi, Syed Akbar Mehdi National University of Sciences.

Extracted directly from:

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

GreenCloud: A Packet-level Simulator of Energy-aware Cloud Computing Data Centers Dzmitry Kliazovich ERCIM Fellow University of Luxembourg Apr 16, 2010.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Presented by: Xianghan Pei

Addressing Data Compatibility on Programmable Network Platforms Ada Gavrilovska, Karsten Schwan College of Computing Georgia Tech.

Enabling Dynamic Memory Management Support for MTC on NVIDIA GPUs Proposed Work This work aims to enable efficient dynamic memory management on NVIDIA.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Blocked 2D Convolution Ravi Sankar P Nair

DENS: Data Center Energy-Efficient Network-Aware Scheduling

Introduction to Operating Systems

GPUNFV: a GPU-Accelerated NFV System

Gwangsun Kim, Jiyun Jeong, John Kim

ISPASS th April Santa Rosa, California

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Introduction to Operating Systems

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

The Main Features of Operating Systems

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt – The University of British Columbia

Problem & Motivation Data centers consume significant amounts of power MemcachedGPU - SoCC'151

Problem & Motivation Data centers consume significant amounts of power Continuously growing demand for higher performance Horizontal or vertical scaling – GP-GPUs MemcachedGPU - SoCC'152

Why GPUs? Highly parallel High energy-efficiency – Green500: GPUs in 7 of top 10 most energy-efficient super computers General-purpose & programmable MemcachedGPU - SoCC'153 CPU GPU

Highlights Network and Memcached processing on GPUs 10 GbE line-rate at all request sizes 95% latency < % peak throughput 75% energy-efficiency of FPGA Maintain Memcached QoS with other workloads MemcachedGPU - SoCC'154

GPU Network Offload Manager (GNoM) Packet metadata Network Card CPU Kernel Module & Network Driver OS Pre-processing Post-processing User-level MemcachedGPU - SoCC'155 Networking Application GPU Packet data Response & Recycle Receive Send

Challenges | Networking on GPUs High throughput – Efficient data movement – Request-level parallelism through batching Low latency – Small batches – Multiple concurrent batches – Task-level parallelism MemcachedGPU - SoCC'156

Application | Memcached MemcachedGPU - SoCC'157 Web Tier Memcached Distributed Key-value Store Memcached Distributed Key-value Store Storage Tier GET SET

Challenges | MemcachedGPU Limited GPU memory sizes MemcachedGPU - SoCC'158 Key & Value Storage Hash Table CPU Memory GPU Memory CPU Memory Hash Table + Key storage Value Storage

Challenges | MemcachedGPU Dynamic memory allocation – Dynamic hash chaining Reduce GET serialization MemcachedGPU - SoCC'159 Hash Table Static set-associative Set 0 Set 1 Set N

Experimental Methodology Single client-server setup with 10 GbE NIC High-performance NVIDIA Tesla K20c GPU – Kepler | TDP = 225W | # Cores = 2496 |Cost = $2700 Low-power NVIDIA GTX 750 Ti GPU – Maxwell | TDP = 60W | # Cores = 640 | Cost = $150 MemcachedGPU - SoCC'1510

Evaluation| Throughput MemcachedGPU - SoCC'1511

Evaluation| Latency MemcachedGPU - SoCC'1512

Evaluation| Power MemcachedGPU - SoCC'1513 High-performance GPU 225W TDP

Evaluation| Energy-efficiency MemcachedGPU - SoCC'1514

Evaluation| Workload Consolidation MemcachedGPU - SoCC'1515 Limited multiprogramming on current GPUs GPU Low-priority background task Memcached Blocked

Evaluation| Workload Consolidation 18X maximum request latency 50% low-priority background runtime MemcachedGPU - SoCC'1516 Background task running

Conclusions Network and Memcached processing on GPUs 10 GbE line-rate at all request sizes 95% latency < % peak throughput 75% energy-efficiency of FPGA Maintain Memcached QoS with other workloads MemcachedGPU - SoCC'1517 Code: