FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho, Hyojae
2 CONTENTS 1. Introduction 2. Background on RDMA 3. FaRM 4. Implementations 1.Shared Address Space 2.Lock-free operations 3.Hashtable 5. Evaluation 6. Conclusion 7. Criticism
3 1. Introduction Hardware trends Main memory is cheap 100GB - 1TB / server 10 – 100TBs in a small cluster New data center networks 40Gbps network throughput (almost 100Gbps in this year) 1-3 microsecond latency RDMA primitives
4 1. Introduction So It is sufficiently able to store all the data in tens – hundreds of TBs of main memory it removes the overhead of disk or flash But Traditional TCP/IP protocol is too slow it makes bottleneck RDMA can solve this problem.
5 2. RDMA RDMA: Remote Direct Memory Access NIC performs DMA requests FaRM uses RDMA extensively use RDMA reads to read data directly use RDMA writes into remote buffers for messaging Great Performance bypasses the kernel bypasses the remote CPU
6 3. FaRM FaRM uses Circular Buffer for RDMA messaging It is stored on the receiver One buffer for each sender/receiver pair
7 3. FaRM Random reads: requests rate per machine
8 3. FaRM Random reads: latency
9 3. FaRM Applications that using FaRM Data center applications irregular access patterns latency sensitive Data serving key-value store graph store Enabling new applications
10 4. Implementations How to program a modern cluster? They have: TBs of DRAM 100s of CPU cores RDMA network Desirable: Keep data in memory Access data using RDMA Collocate data and computation
11 4. Implementations Symmetric model Access to local memory is much faster Server CPUs are mostly idle with RDMA
12 4. Implementations Shared address space Supports direct RDMA of objects
13 4. Implementations Transactions: simplify programming General Primitive
14 4. Implementations Optimizations: lock-free reads
15 4. Implementations FaRM’s API Tx* txCreate(); void txAlloc(Tx *t, int size, Addr a, Cont *c); void txFree(Tx *t, Addr a, Cont *c); void txRead(Tx *t, Addr a, int size, Cont *c); void txWrite(Tx *t, ObjBuf *old, ObjBuf *new); void txCommit(Tx *t, Cont *c); Lf* lockFreeStart(); void lockFreeRead(Lf* op, Addr a, int size, Cont *c); void lockFreeEnd(Lf *op); Incarnation objGetIncarnation(ObjBuf *o); void objIncrementIncarnation(ObjBuf *o); void msgRegisterHandler(MsgId i, Cont *c); void msgSend(Addr a, MsgId i, Msg *m, Cont *c);
16 4. Implementations Traditional lock-free reads
17 4. Implementations Traditional lock-free reads Problem: read requires 3 network accesses so it is not well suited to RDMA
18 4. Implementations FaRM lock-free reads
19 4. Implementations Hashtable FaRM provices a general key-value store interface it is implemented as a hashtable on top of the shared address space Designed new algorithm, chained associative hopscotch hashing good balance between space efficiency and size and number of RDMAs.
20 5. Evaluation Test Environment an isolated cluster with 20 machinesh Each machine had a 40GBps Mellanox ConnectX-3 RoCE NIC a 240GB Intel 520 SSD for logging 128GBs of DRAM - configured with 28GB of private memory, - 100GB of shared memory
21 5. Evaluation
22 5. Evaluation TAO Facebook’s in-memory graph store Workload read-dominated (99.8%) 10 operation types FaRM implementation nodes and edges are FaRM objects lock-free reads for lookups transaction for updates operations : 10x improvement latency: 40-50x improvement
23 6. Conclusion FaRM Platform for distributed computing data is in memory RDMA Shared memory abstraction transaction lock-free reads Order-of magnitude performance improvements Enables new applications
24 7. Criticism