HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
SLA-Oriented Resource Provisioning for Cloud Computing
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,
Loose-Ordering Consistency for Persistent Memory
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
September 24, 2007The 3 rd CSAIL Student Workshop Byzantine Fault Tolerant Cooperative Caching Raluca Ada Popa, James Cowling, Barbara Liskov Summer UROP.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Scaling Dynamic Content Applications through Data Replication - Opportunities for Compiler Optimizations Cristiana Amza UofT.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Dilip N Simha, Maohua Lu, Tzi-cher chiueh Park Chanhyun ASPLOS’12 March 3-7, 2012 London, England, UK.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Threads. Readings r Silberschatz et al : Chapter 4.
Service Pack 2 System Center Configuration Manager 2007.
State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.
Presented by: Xianghan Pei
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Persistent Memory (PM)
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Hathi: Durable Transactions for Memory using Flash
Free Transactions with Rio Vista
NoSQL Stores for Coreless Mobile Networks
Distributed Database Management Systems
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Curator: Self-Managing Storage for Enterprise Clusters
PHyTM: Persistent Hybrid Transactional Memory
Pytheas: Enabling Data-Driven Quality of Experience Optimization Using Group-Based Exploration-Exploitation Junchen Jiang (CMU) Shijie Sun (Tsinghua Univ.)
Alternative system models
Using non-volatile memory (NVDIMM-N) as block storage in Windows Server 2016 Tobias Klima Program Manager.
MongoDB Distributed Write and Read
Timothy Zhu and Huapeng Zhou
Faster Data Structures in Transactional Memory using Three Paths
HPE Persistent Memory Microsoft Ignite 2017
Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor
Frequency Governors for Cloud Database OLTP Workloads
Boost Linux Performance with Enhancements from Oracle
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Enabling the NVMe™ CMB and PMR Ecosystem
NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Repairing Write Performance on Flash Devices
OS Virtualization.
PASTE: A Networking API for Non-Volatile Main Memory
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Outline Announcements Fault Tolerance.
Building a Database on S3
EECS 498 Introduction to Distributed Systems Fall 2017
Xen Network I/O Performance Analysis and Opportunities for Improvement
Free Transactions with Rio Vista
Chapter 2: Operating-System Structures
Prof. Leonardo Mostarda University of Camerino
Chih-Hsun Chou Daniel Wong Laxmi N. Bhuyan
THE GOOGLE FILE SYSTEM.
NVMe.
Chapter 2: Operating-System Structures
Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,
Sean Choi, Seo Jin Park, Muhammad Shahbaz,
Presentation transcript:

HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu Jitendra Padhye, Shachar Raindel, Steven Swanson Vyas Sekar, Srinivasan Seshan

Multi-tenant Storage Systems Storage frontend Replicated transactions Data availability and integrity Consistent and atomic updates e.g., Chain replication Multiple replica instances are co-located on the same server Replica servers

Problem: Large and Unpredictable Latency YCSB on Both average and tail latencies increase Gap between average and tail latencies increases

CPU Involvement on Replicas Group of replicas 1. Execute Logging 2. Forward Logging 1. Execute Logging 2. Forward ACK Run by CPUs Logging DATA Logging DATA Frontend Software Replica Software DATA Replica Software DATA ACK Storage/Net Library Storage/Net Library Storage/Net Library NIC Storage DATA NIC Storage NIC Log Critical path of operations DATA DATA CPU involvement for executing and forwarding transactional operations Arbitrary CPU scheduling  Unpredictable latency Replicas’ CPU utilization hits 100%

Our Goal Today’s storage system Run by CPUs Frontend Software Replica Software Replica Software Storage/Net Library Storage/Net Library Storage/Net Library NIC Storage NIC Storage NIC Critical path of operations Removing replica CPUs from the critical path! Frontend Software Replica Software Replica Software Storage/Net Library Storage/Net Library Storage/Net Library NIC DATA Storage DATA NIC Storage NIC ACK

Our Work: HyperLoop Framework for building fast chain replicated transactional storage systems enabled by three key ideas: RDMA NIC + NVM Leveraging the programmability with RDMA NICs APIs covering key transactional operations Minimal modifications for existing applications e.g., 866 lines for MongoDB out of ~500K lines of code Up to 24x tail latency reduction in storage applications

Outline Motivation HyperLoop Design Implementation and Evaluation

Idea 1: RDMA NIC + NVM RDMA (Remote Direct Memory Access) NICs Replica Software Replica Software Storage/Net Library NVM/RDMA Library Storage NIC NVM RDMA NIC RDMA (Remote Direct Memory Access) NICs Enables direct memory access from the network without CPUs NVM (Non-Volatile Memory) Provides a durable storage medium for RDMA NICs

Roadblock 1: Operation Forwarding 1. Execute Logging 2. Forward Logging Group of replicas 1. Execute Logging 2. Forward ACK Run by CPUs Logging DATA (WRITE) Frontend Software Replica Software Replica Software NVM/RDMA Library NVM/RDMA Library NVM/RDMA Library RNIC NVM RNIC NVM RNIC Log DATA DATA RDMA NICs can execute the logging operation CPUs are still involved to request the RNICs to forward operations

Roadblock 2: Transactional Operations Group of replicas 1. Execute Commit 2. Forward Commit 1. Execute Commit 2. Forward ACK Commit log Run by CPUs Frontend Software Replica Software Replica Software NVM/RDMA Library NVM/RDMA Library NVM/RDMA Library RNIC NVM RNIC NVM RNIC Data DATA DATA Log DATA DATA CPUs are involved to execute and forward operations RDMA NIC primitives do not support some key transactional operations (e.g., locking, commit)

Can We Avoid the Roadblocks? Today’s storage system Frontend Software Replica Software Replica Software Storage/Net Library Storage/Net Library Storage/Net Library NIC Storage NIC SSD NIC Critical path of operations Pushing replication functionality to RNICs! Frontend Software Replica Software Replica Software Storage/Net Library Storage/Net Library Storage/Net Library Offload? NIC NVM RNIC NVM RNIC

Idea 2: Leveraging the Programmability of RNICs Commodity RDMA NICs are not fully programmable Opportunity: RDMA WAIT operation Supported by commodity RDMA NICs Allows a NIC to wait for a completion of an event (e.g., receiving) Triggers the NIC to perform an operation upon the completion

Bootstrapping – Program the NICs 1. WAIT for receiving 2. Forward WRITE with Param (Src, Dst, Len) 1. WAIT for receiving 2. Forward ACK R1 R2 Frontend Software Replica Software Replica Software HyperLoop Library HyperLoop Library HyperLoop Library RNIC NVM RNIC NVM RNIC Step 1: Frontend library collects the base addresses of memory regions registered to replica RNICs Step 2: HyperLoop library programs replica RNICs with RDMA WAIT and the template of target operation

Forwarding Operations 1. WAIT for receiving 2. Forward WRITE with Param (Src, Dst, Len) 1. WAIT for receiving 2. Forward ACK ✔ ✔ ✔ Update log (WRITE) ✔ R1 Param (R1-0xA, R2-0xB, 64) R2 Frontend Software Replica Software Replica Software HyperLoop Library HyperLoop Library HyperLoop Library Idle CPUs SEND Param(R1-0xA, R2-0xB, 64) WRITE DATA WRITE DATA WRITE ACK RNIC NVM RNIC NVM RNIC R1-0xA R2-0xB DATA DATA Idea: Manipulating parameter regions of programmed operations Replica NICs can forward operations with proper parameters

Idea 3: APIs for Transactional Operations Memory Operations HyperLoop Primitives Logging Memory write to log region Group Log Commit Memory copy from log to data region Group Commit Locking/Unlocking Compare and swap on lock region Group Lock/Unlock See the paper for details!

Transaction with HyperLoop Primitives Update log (Group Log) Grab a lock (Group Lock) Commit the log (Group Commit) Release the lock (Group Unlock) Replica NICs can execute and forward operations! Frontend Software Replica Software Replica Software HyperLoop Library HyperLoop Library HyperLoop Library RNIC NVM RNIC NVM RNIC Data Log

HyperLoop: Putting It All Together Idea 3: APIs for transactional operations: Group- Log, Commit, Lock/Unlock Frontend Software Replica Software Replica Software HyperLoop Library HyperLoop Library HyperLoop Library RNIC NVM RNIC NVM RNIC Idea 2: Leveraging the programmability of RNICs Idea 1: RDMA NIC + NVM

Outline Motivation HyperLoop Design Implementation and Evaluation

Implementation and Case Studies HyperLoop library C APIs for HyperLoop group primitives Modify user-space RDMA verbs library and NIC driver Case Studies RocksDB: Add the replication logic using HyperLoop library (modify 120 LOCs) MongoDB: Replace the existing replication logic with HyperLoop library (modify 866 LOCs)

Evaluation Experimental setup Baseline implementation Experiments 8 servers with two Intel 8-core CPUs, 64 GB RAM, Mellanox CX-3 NIC Use tmpfs to emulate NVM Baseline implementation Provides the same APIs as HyperLoop, but involve replica CPUs Experiments Microbenchmark: Latency reduction in group memory operations Application benchmark: Latency reduction in applications

Microbenchmark – Write Latency 105 104 Latency (μs) 103 801.8x 102 10 1

Application benchmark – MongoDB 4.9x

Other Results Latency reduction for other memory operations on a group: Memory copy: 496 – 848x Compare-and-Swap: 849x Tail latency reduction for RocksDB: 5.7 – 24.2x Latency reduction regardless of group sizes

Conclusion Predictable low latency is lacking in replicated storage systems Root cause: CPU involvement on the critical path Our solution: HyperLoop Offloads transactional operations to commodity RNICs + NVM Minimal modifications for existing applications Result: up to 24x tail latency reduction in in-memory storage apps More opportunities Other data center workloads Efficient remote memory utilization