FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

Slides:

Advertisements

Similar presentations

M AINTAINING L ARGE A ND F AST S TREAMING I NDEXES O N F LASH Aditya Akella, UW-Madison First GENI Measurement Workshop Joint work with Ashok Anand, Steven.

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

SILT: A Memory-Efficient, High-Performance Key-Value Store

FAWN: Fast Array of Wimpy Nodes Developed By D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, V. Vasudevan Presented by Peter O. Oliha.

FAWN: Fast Array of Wimpy Nodes A technical paper presentation in fulfillment of the requirements of CIS 570 – Advanced Computer Systems – Fall 2013 Scott.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Windows RDMA File Storage

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

OMFS An Object-Oriented Multimedia File System for Cluster Streaming Server CHENG Bin, JIN Hai Cluster & Grid Computing Lab Huazhong University of Science.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Amar Phanishayee,LawrenceTan,Vijay Vasudevan

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

RAMCloud Overview and Status John Ousterhout Stanford University.

C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:

Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS Moontae Lee (Nov 20, 2014)Part 1 CS6410.

Use case of RDMA in Symantec storage software stack Om Prakash Agarwal Symantec.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Background Computer System Architectures Computer System Software.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

Brian Lauge Pedersen Senior DataCenter Technology Specialist Microsoft Danmark.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

CS239-Lecture 15 Transactions Madan Musuvathi Visiting Professor, UCLA Principal Researcher, Microsoft Research.

Network Requirements for Resource Disaggregation

Hathi: Durable Transactions for Memory using Flash

Balazs Voneki CERN/EP/LHCb Online group

COMPUTER NETWORKS CS610 Lecture-27 Hammad Khalid Khan.

Web Server Load Balancing/Scheduling

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Web Server Load Balancing/Scheduling

Alternative system models

BD-CACHE Big Data Caching for Datacenters

Windows Server* 2016 & Intel® Technologies

HPE Persistent Memory Microsoft Ignite 2017

CHAPTER 3 Architectures for Distributed Systems

SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing

RDMA Extensions for Persistency and Consistency

Boost Linux Performance with Enhancements from Oracle

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

Be Fast, Cheap and in Control

HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.

Introduction to Computer Systems

Outline Midterm results summary Distributed file systems – continued

Fast Accesses to Big Data in Memory and Storage Systems

A Closer Look at NFV Execution Models

Cluster Computers.

Presentation transcript:

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho, Hyojae

2 CONTENTS 1. Introduction 2. Background on RDMA 3. FaRM 4. Implementations 1.Shared Address Space 2.Lock-free operations 3.Hashtable 5. Evaluation 6. Conclusion 7. Criticism

3 1. Introduction Hardware trends Main memory is cheap 100GB - 1TB / server 10 – 100TBs in a small cluster New data center networks 40Gbps network throughput (almost 100Gbps in this year) 1-3 microsecond latency RDMA primitives

4 1. Introduction So It is sufficiently able to store all the data in tens – hundreds of TBs of main memory it removes the overhead of disk or flash But Traditional TCP/IP protocol is too slow it makes bottleneck RDMA can solve this problem.

5 2. RDMA RDMA: Remote Direct Memory Access NIC performs DMA requests FaRM uses RDMA extensively use RDMA reads to read data directly use RDMA writes into remote buffers for messaging Great Performance bypasses the kernel bypasses the remote CPU

6 3. FaRM FaRM uses Circular Buffer for RDMA messaging It is stored on the receiver One buffer for each sender/receiver pair

7 3. FaRM Random reads: requests rate per machine

8 3. FaRM Random reads: latency

9 3. FaRM Applications that using FaRM Data center applications irregular access patterns latency sensitive Data serving key-value store graph store Enabling new applications

10 4. Implementations How to program a modern cluster? They have: TBs of DRAM 100s of CPU cores RDMA network Desirable: Keep data in memory Access data using RDMA Collocate data and computation

11 4. Implementations Symmetric model Access to local memory is much faster Server CPUs are mostly idle with RDMA

12 4. Implementations Shared address space Supports direct RDMA of objects

13 4. Implementations Transactions: simplify programming General Primitive

14 4. Implementations Optimizations: lock-free reads

15 4. Implementations FaRM’s API Tx* txCreate(); void txAlloc(Tx *t, int size, Addr a, Cont *c); void txFree(Tx *t, Addr a, Cont *c); void txRead(Tx *t, Addr a, int size, Cont *c); void txWrite(Tx *t, ObjBuf *old, ObjBuf *new); void txCommit(Tx *t, Cont *c); Lf* lockFreeStart(); void lockFreeRead(Lf* op, Addr a, int size, Cont *c); void lockFreeEnd(Lf *op); Incarnation objGetIncarnation(ObjBuf *o); void objIncrementIncarnation(ObjBuf *o); void msgRegisterHandler(MsgId i, Cont *c); void msgSend(Addr a, MsgId i, Msg *m, Cont *c);

16 4. Implementations Traditional lock-free reads

17 4. Implementations Traditional lock-free reads Problem: read requires 3 network accesses so it is not well suited to RDMA

18 4. Implementations FaRM lock-free reads

19 4. Implementations Hashtable FaRM provices a general key-value store interface it is implemented as a hashtable on top of the shared address space Designed new algorithm, chained associative hopscotch hashing good balance between space efficiency and size and number of RDMAs.

20 5. Evaluation Test Environment an isolated cluster with 20 machinesh Each machine had a 40GBps Mellanox ConnectX-3 RoCE NIC a 240GB Intel 520 SSD for logging 128GBs of DRAM - configured with 28GB of private memory, - 100GB of shared memory

21 5. Evaluation

22 5. Evaluation TAO Facebook’s in-memory graph store Workload read-dominated (99.8%) 10 operation types FaRM implementation nodes and edges are FaRM objects lock-free reads for lookups transaction for updates operations : 10x improvement latency: 40-50x improvement

23 6. Conclusion FaRM Platform for distributed computing data is in memory RDMA Shared memory abstraction transaction lock-free reads Order-of magnitude performance improvements Enables new applications

24 7. Criticism