SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing

Slides:

Advertisements

Similar presentations

Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.

Advertisements

L.N. Bhuyan Adapted from Patterson’s slides

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Cache Optimization Summary

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

Heterogeneous Memory & Its Impact on Rack-Scale Computing Babak Falsafi Director, EcoCloud ecocloud.ch Contributors: Ed Bugnion, Alex Daglis, Boris Grot,

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

CS533 Concepts of Operating Systems Class 9 Lightweight Remote Procedure Call (LRPC) Rizal Arryadi.

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,

Synchronization and Communication in the T3E Multiprocessor.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)

Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Network Requirements for Resource Disaggregation

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Concurrent Data Structures for Near-Memory Computing

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Multiprocessor Cache Coherency

EE 107 Fall 2017 Lecture 7 Serial Buses – I2C Direct Memory Access

Cache Memory Presentation I

Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016

CMSC 611: Advanced Computer Architecture

11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

CMSC 611: Advanced Computer Architecture

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Lecture: Cache Innovations, Virtual Memory

Lecture 17: Transactional Memories I

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Lecture 9: Directory Protocol Implementations

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

High Performance Computing

Chapter 4 Multiprocessors

THE GOOGLE FILE SYSTEM.

Lecture 23: Virtual Memory, Multiprocessors

Lecture 8: Efficient Address Translation

Lecture 23: Transactional Memory

Wireless Embedded Systems

Presentation transcript:

SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing Alexandros Daglis Dmitrii Ustiugov, Stanko Novakovic Edouard Bugnion, Babak Falsafi, Boris Grot

Massive datasets, immense reader concurrency Online Services write reads Powered by datacenters PBs of data, millions of users Mostly reads, occasional writes Massive datasets, immense reader concurrency

One-sided operations do not provide object read atomicity Remote Memory Access Frequent remote memory accesses within datacenter Networking traditionally expensive RDMA technology Fast one-sided ops Object semantics Need to read objects atomically NI CPU NI CPU Application zZz Memory Memory One-sided operations do not provide object read atomicity

Object Atomicity in Software One-sided ops’ atomicity limited to single cache line Object atomicity provided by SW: embedded metadata SW checks expensive: up to 50% of a remote object read! zZz post-transfer atomicity check by CPU NI CPU Memory CPU NI Memory object M M object Network SABRes: minimal NI extensions, up to 2x faster object reads

Outline Overview Background SABRes design Results Summary

Modern Distributed Object Stores Infrequent writes (over RPC) Optimistic one-sided reads (RDMA) Rare conflicts (read-dominated workloads) High concurrency & fast remote memory accesses Software mechanisms for object read atomicity Hardware does not provide one-sided atomic object reads Examples: Pilaf [Mitchell’13], FaRM [Dragojevic’14, ’15] SW mechanisms enable concurrent atomic object reads

State-of-the-art Atomic Reads Per-cache-line versions (FaRM [Dragojevic’14,’15]) CPU NI Source: Server A Destination: Server B CPU NI Memory Memory Object layout in memory ✓ ✓ 3 2 3 ✓ writer ✓ CPU checks versions & copies (Same for LOCAL accesses!) Application data Cache-line versions ✗ CPU overhead for version embedding/stripping ✗ Intermediate buffering for all reads & writes

Emerging Networking Technologies RDMA Lean user-level protocol High-performance fabrics One-sided ops Scale-Out NUMA [ASPLOS’14] All the above Tight integration – On-chip NI SW atomicity 15% Object read latency Transfer soNUMA 50% RDMA Faster remote memory access exacerbates SW overheads

Software Atomicity Overhead soNUMA: remote access latency ~4x of local [ASPLOS’14] FaRM KV store ARM Cortex-A57 cores Up to 50% Transfer is small fraction Need hardware support for object atomicity

Can meet goals with hardware extensions in NI Design Goals Low latency & high throughput Low single-transfer latency, high inter-transfer concurrency Lightweight HW support Small hardware footprint Detect atomicity violations early Unlike post-transfer checks of SW solutions Can meet goals with hardware extensions in NI

Towards Hardware Atomic Reads Leverage NI’s coherent on-chip integration Detect atomicity violations End-to-end guarantees: objects are structured SW constructs Range of consecutive addresses Object header w/ version Writers always increment version before & after data update Last-level cache NI CPU Inter-node network Memory Version Data HW SW co-design for efficient hardware

SABRes: Atomic Remote Object Reads in Hardware Leverage: - “version+data” object layout NI CPU Remote read NI logic read version 1 write 2 read data Reply (data or abort) 3 Re-read version & compare LLC Network VERSION MISMATCH  ABORT! Memory Serialized version/data reads increase latency Memory hierarchy data replies Object 0 … MMU read cache lines requests Invalidation Controller logic One-sided ops controller at destination Inter-node network Remote reads Local writes Our goal: Simple hardware for zero-overhead atomic object reads Object 1 Object N Address tracking One-sided ops controller at destination requests read cache lines Object 0 Remote reads Object 1 Local writes MMU Invalidation Controller logic … data replies Invalidation Object N Inter-node network Address tracking Memory hierarchy

SABRes: Atomic Remote Object Reads in Hardware Leverage: - “version+data” object layout - Coherent NI integration & object address contiguity CPU Speculatively read version & data Remote read NI logic write Address range tracking invalidation ABORT! LLC Network Memory Simple hardware, atomicity w/ zero latency overhead Memory hierarchy data replies Object 0 … MMU read cache lines requests Invalidation Controller logic One-sided ops controller at destination Inter-node network Remote reads Local writes Our goal: Simple hardware for zero-overhead atomic object reads Object 1 Object N Address tracking One-sided ops controller at destination requests read cache lines Object 0 Remote reads Object 1 Local writes MMU Invalidation Controller logic … data replies Invalidation Object N Inter-node network Address tracking Memory hierarchy

Address Range Tracking Implementation invalidation data replies NI logic Network read cache lines - > base Object read (6 cache lines) Send payload back to requester CONFLICT – ABORT! Object 0 Reply (abort) - > base Object 1 LLC … … Memory - > base Object N Stream buffers Stream buffer entry state: Free Used (awaiting reply from memory) Completed (memory reply received) Speculation: no serialization & early conflict detection

Latency reduction  commensurate throughput improvements FaRM KV Store on soNUMA SW atomicity SABRes SABRes w/ 16 32-entry stream buffers  35-52% lower E2E latency Remove SW atomicity check overhead Remove intermediate buffer management Latency reduction  commensurate throughput improvements

Summary Software object atomicity mechanisms increasingly more costly SABRes: lightweight NI extensions for atomic remote object reads Up to 2x latency & throughput benefit for distributed object stores

For more information please visit us at parsa.epfl.ch/sonuma Thank you! Questions? For more information please visit us at parsa.epfl.ch/sonuma