MV-RLU: Scaling Read-Log-Update with Multi-Versioning

Slides:

Advertisements

Similar presentations

Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.

Advertisements

Chapter 12: File System Implementation

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Building a Database on S3 Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, Tim Kraska Xiang Zhang

McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.

Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

Read-Copy Update P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Russell, D. Saram, M. Soni Ottawa Linux Symposium 2001 Presented by Bogdan Simion.

Chapter 11: File System Implementation

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

DMITRI PERELMAN IDIT KEIDAR TRANSACT 2010 SMV: Selective Multi-Versioning STM 1.

Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.

G Robert Grimm New York University Cool Pet Tricks with… …Virtual Memory.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Memory Management (II)

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Optimizing RAM-latency Dominated Applications

1 The Google File System Reporter: You-Wei Zhang.

Cosc 4740 Chapter 6, Part 3 Process Synchronization.

School of Information Technologies Michael Cahill 1, Uwe Röhm and Alan Fekete School of IT, University of Sydney {mjc, roehm, Serializable.

1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.

DOUBLE INSTANCE LOCKING A concurrency pattern with Lock-Free read operations Pedro Ramalhete Andreia Correia November 2013.

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

CS333 Intro to Operating Systems Jonathan Walpole.

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE)

Read-Copy-Update Synchronization in the Linux Kernel 1 David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

Disk Cache Main memory buffer contains most recently accessed disk sectors Cache is organized by blocks, block size = sector’s A hash table is used to.

Adaptive Software Lock Elision

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Lecture 20: Consistency Models, TM

Jonathan Walpole Computer Science Portland State University

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Alex Kogan, Yossi Lev and Victor Luchangco

Combining HTM and RCU to Implement Highly Efficient Balanced Binary Search Trees Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios.

Java 9: The Quest for Very Large Heaps

Håkan Sundell Philippas Tsigas

Faster Data Structures in Transactional Memory using Three Paths

Introduction to NewSQL

A Lock-Free Algorithm for Concurrent Bags

Expander: Lock-free Cache for a Concurrent Data Structure

Practical Non-blocking Unordered Lists

Anders Gidenstam Håkan Sundell Philippas Tsigas

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Concurrent Data Structures Concurrent Algorithms 2016

HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.

Arrays and Linked Lists

Concurrent Data Structures Concurrent Algorithms 2017

Kernel Synchronization II

Practical Concerns for Scalable Synchronization

Printed on Monday, December 31, 2018 at 2:03 PM.

Understanding Transaction Isolation Levels

Lecture 22: Consistency Models, TM

Hybrid Transactional Memory

Software Transactional Memory Should Not be Obstruction-Free

Let Me Finish... Isolating Write Operations

Decoupled Storage: “Free the Replicas!”

by Mikael Bjerga & Arne Lange

Lecture 23: Transactional Memory

File System Performance

Honnappa Nagarahalli Arm

Presentation transcript:

MV-RLU: Scaling Read-Log-Update with Multi-Versioning Jaeho Kim, Ajit Mathew, Sanidhya Kashyap† Madhava Krishnan Ramanathan, Changwoo Min †

Core count continues to rise….

Synchronization mechanisms are essential A single scalability bottleneck in synchronization mechanism can result in a performance collapse with increasing core count [Boyd-Wickizer,2010] Scalability of program depends on scalability of underlying synchronization mechanism Synchronization mechanisms are essential building blocks of today’s application. Researchers at MIT have shown that a single scalability bottleneck of synchronization mechanism can cause performace collapse at high core count. Hence it can be said that scalability of a program depends on the underlying synchronization algorithm, This make search for new scalable SM an interesting problem

Can synchronization mechanisms scale at high core count? Ideal Scaling Saturation Let us consider a microbenchmark of concurrent lockfree hash table with 10% updates with 1k elements. Here the throuput is shown in the y axis in MOPS and threads in the x axis. Ideally the performance should show linear scaling. But the performance of lock free hash table saturates at about 120 core. The reason for this saturation is non scalable garbage collection

Read Copy Update (RCU) Widely used in Linux kernel Readers never block Multi-pointer update is difficult A B C B’ B

Read-Log-Update (RLU) [Matveev’15] Readers do not block Allow multi-pointer update Key idea: Use global clock and per thread log to make updates atomically visible An improvement over RCU is RLU or read log update. In RLU readers do not block but RLU provides better programmability as it allows multi-pointer updates. It is able to do so ..

Even RCU and RLU does not scale Even non linearizable algorithms like RCU which is highly used in linux kernel does not scale and it’s performance saturates at about 100 cores. RLU which is similar to RCU in that it provides synchronization free reads and better programmability through multi-pointer update shows performance collapse at moderately high core count. Since RLU has excellent properties making it scalable is an interesting problem Saturation Performance Collapse

Why does not RLU scale? A B C D B’ Legend Object Version of Object Header Waiting Reclaim! A B C D B’ Per thread version log To understand why RLU does not scale let us consider a RLU based concurrent linked list. A thread modifies node B which in turn creates a new version of B. Then another thread tries to modify B but is not allowed to proceed because RLU restricts the number of simultaneous versions of an object to two. So the second thread has to reclaim the older version of object B for which it will wait for rlu threads to exit the critical section. This synchronous waiting Synchronous waiting due to restriction on number of versions per object is bottleneck in RLU design

How to scale RLU? Problem: Restriction in number of versions causes synchronous waiting Solution: Remove restriction on number of version == Multi-Versioning

Scaling RLU through Multi-Versioning Multi-Version Read-Log-Update (MV-RLU) Allows multiple version to exists at same time Removes synchronous waiting from critical path Scaling Multi-versioning Scalable Timestamp allocation Concurrent and autonomous garbage collection MV-RLU shows 4-8x performance improvement with KyotoCabinet

Outline Background Design Evaluation Conclusion Overview Reads/Writes in MV-RLU Isolation Garbage Collection Evaluation Conclusion

Design: Overview A A’’ A’ Master Object Per thread version log (55) Copy Object Version Chain In MVRLU, object which we call master object can have zero or more versions which we call copy object. Copy objects store the timestamp of when they were created. Copy objects also store the pointer of the next older version creating a version chain. Copy objects are stored in the per version thread log of the thread that created it. Master object store the head of the version chain. Per thread version log A’ (25) Copy Object Commit Timestamp

Updates in MV-RLU A B C D No synchronous waiting B’’ B’ Per thread version log B’’ (55) To understand how MVRLU prevent synchronous waiting let us see how updates are done in MV-RLU. Consider a MVRLU based concurrent linked list. A threads udates node B which creates a new copy object of B with timestamp 25. Another thread then updates B, which creates another copy object. Because MVRLU does not restrict the number of versions of an object, the second thread does not need to synchronize with other reader/writer threads in the critical section. Per thread version log B’ (25)

Reads in MV-RLU A B C D B’’ B’ Reader note the global clock at start of critical section read clock = 35 A B C D 35 < 55 read clock > version commit timestamp Per thread version log B’’ (55) Since MV-RLU supports multiple version of the same data, reads are slightly more involved. When a reader enters a critical section, it reads the global clock to get a read clock. Then it traverses the version chain to find the first node which statisfies the propert read clock is less than version timestamp. 35 > 25 Per thread version log B’ (25)

Writers never abort readers Default Isolation: Snapshot Isolation Every reader gets consistent snapshot of data structure Can cause write skew anomaly Can be made serializable by locking extra nodes For eg: locking current and previous nodes in linked list In MVRLU, every reader will read a consistent snapshot of the data structure. Hence the default isolation level of MVRLU is Snapshot isolation. But snapshot isolation can cause write skew. But this can be avoided by locking extra nodes. For eg in a linked list, locking the predecessor of the node being modified can prevent write skew. For this MVRLU provides a special API try_lock_const(). It should be noted that writers never abort readers in MVRLU leading to high performance.

Memory is limited! Garbage Collection is needed Thread 1 Log Full! Reclaim obsolete version Should be scalable Per Thread Log Used

Challenges to garbage collection How to detect obsolete versions in a scalable manner? Reference counting, hazard pointer do not scale well. How to reclaim obsolete versions? Single thread is insufficient, GC should be multi-threaded When to trigger garbage collection? Eager: Triggering too often wastes CPU cycles Lazy: Increases version chain traversal cost.

Detecting obsolete version Quiescent State based Reclamation (QSBR) Grace Period: Time interval between which each thread has been outside critical section at least once Grace Period detection is delegated to a special thread

Reclaiming Obsolete version Concurrent Reclamation Every thread reclaims it’s own log Cache friendly Scales with increasing number of threads

Triggering Garbage Collection Ideally, triggers should be workload agnostic Two conditional Trigger or Watermark Capacity Watermark Dereference Watermark

Writers block when log is full Thread 1 Start GC. I will wait In mv-rlu, when writers performs updates, it uses log space and eventually the log will be full after which the thread will wait for garbage collection to reclaim log space. Per Thread Log Used

Writers block when log is full Log about to get full. Start GC Thread 1 Thread 1 Start GC. I will wait Capacity Watermark Instead, if the thread triggers garbage collection when the log utilization crosses a certain threshold, it can continue doing useful work and hopefully, GC can reclaim log space before the log fills up. This threshold is called capacity water mark. Per Thread Log Used

Prevent log from filling up using capacity watermark Writers block when log is full To prevent blocking, start GC when log is almost full Capacity Watermark Trigger GC when a thread’s log is about to get full Works well in write heavy workload

GC Example Capacity Watermark Okay! Here is the last GP Done Done I need GC zzz… Grace period detector thread Thread 1 Thread 2 Thread 3 Capacity Watermark Per Thread Log: Used

Capacity Watermark is not sufficient Capacity watermark will not be triggered in read mostly workload Master Object A B C D A’ Capacity watermark is not sufficient as it will not be triggered in read mostly workload which is a problem. To understand how, let us consider a linked list protected by MV-RLU. If no objects have a version then reader just read the master object to traverse the linked list. But if any object has a version chain, then reader needs to access the version chain to traverse the list. Copy Object

Capacity Watermark is not sufficient Capacity watermark will not be triggered in read mostly workload Master Object A B C D Garbage Collection A’ B’ C’ D’ In worst case, if most objects have a version, then every object read will require access to version chain. This pointer chasing slows down read performance. To alleviate this, we writeback copy object to master object during garbage collection cycle. Copy Object Pointer Chasing slows down read performance

Capacity Watermark is not sufficient Capacity watermark will not be triggered in read mostly workload A’ B’ C’ D’ A’ B’ C’ D’

Dereference watermark to reduce pointer chasing Readers check if they are accessing version chain too often If yes, trigger GC. Combination of capacity watermark and dereference water mark makes GC trigger workload agnostic So to reduce pointer chasing by triggerin garbage collection to do writeback, we use dereference water mark. Readers check if they are accessing version chain too often.

More detail Scalable timestamp allocation Version Management Proof of correctness Implementation details

Outline Background Design Evaluation Conclusion

Evaluation Question Does MV-RLU scale? What is the impact of our proposed approaches? What is its impact on real-world workloads?

Evaluation Platform 8 socket, 448 core server Intel Xeon Platinum 8180 337 GB Memory Linux 4.17.3

Microbenchmark 150x speedup over RLU (Higher is better) 1K element, load factor: 1

Factor Analysis 2x 4x 2x GC is bottleneck Million Operations Per Second

Key Value Benchmark 8x speedup over RLU

Conclusion MV-RLU: Scaling RLU through Multi-Versioning Multi-Versioning removes synchronous waiting Concurrent and autonomous garbage collection MV-RLU show unparallel performance for a variety of benchmark. https://github.com/cosmoss-vt/mv-rlu

Multi-pointer update A B C D B’ B’ D’ D’ Per thread version log (55) (∞) D’ (55) D’ (∞)

Snapshot Isolation MVRLU Serializable Snapshot Isolation SSI works well for any application which can tolerate stale read RCU is widely used which means lot of application for MV-RLU

Log size Memory in computer systems is increasing Persistent memory can increase total main memory significantly Log size is a tuning parameter.