Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE)

Slides:

Advertisements

Similar presentations

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.

Advertisements

Threads Cannot be Implemented As a Library Andrew Hobbs.

1 Concurrency Control Chapter Conflict Serializable Schedules  Two actions are in conflict if  they operate on the same DB item,  they belong.

Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,

Transactional Locking Nir Shavit Tel Aviv University (Joint work with Dave Dice and Ori Shalev)

CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.

CS510 Concurrent Systems Jonathan Walpole. What is RCU, Fundamentally? Paul McKenney and Jonathan Walpole.

RCU in the Linux Kernel: One Decade Later by: Paul E. Mckenney, Silas Boyd-Wickizer, Jonathan Walpole Slides by David Kennedy (and sources)

Data Structures: A Pseudocode Approach with C

Synchronization. Shared Memory Thread Synchronization Threads cooperate in multithreaded environments – User threads and kernel threads – Share resources.

Read-Copy Update P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Russell, D. Saram, M. Soni Ottawa Linux Symposium 2001 Presented by Bogdan Simion.

11/2/20111 The Ordering Requirements of Relativistic and Reader-Writer Locking Approaches to Shared Data Access Philip W. Howard with Jonathan Walpole,

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

10/31/20111 Relativistic Red-Black Trees Philip Howard 10/31/2011

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

What is RCU, fundamentally? Sri Ramkrishna. Intro RCU stands for Read Copy Update  A synchronization method that allows reads to occur concurrently with.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

B+ Review. B+ Tree: Most Widely Used Index Insert/delete at log F N cost; keep tree height- balanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy.

OSE 2013 – synchronization (lec3) 1 Operating Systems Engineering Locking & Synchronization [chapter #4] By Dan Tsafrir,

1 The Google File System Reporter: You-Wei Zhang.

Relativistic Red Black Trees. Relativistic Programming Concurrent reading and writing improves performance and scalability – concurrent readers may disagree.

CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.

Cosc 4740 Chapter 6, Part 3 Process Synchronization.

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

Data structures and algorithms in the collection framework 1 Part 2.

A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.

Kernel Locking Techniques by Robert Love presented by Scott Price.

The Relational Model1 Transaction Processing Units of Work.

DOUBLE INSTANCE LOCKING A concurrency pattern with Lock-Free read operations Pedro Ramalhete Andreia Correia November 2013.

Fundamentals of Parallel Computer Architecture - Chapter 51 Chapter 5 Parallel Programming for Linked Data Structures Yan Solihin.

Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.

CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

How D can make concurrent programming a piece of cake Bartosz Milewski D Programming Language.

A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.

A Relativistic Enhancement to Software Transactional Memory Philip Howard, Jonathan Walpole.

Barriers and Condition Variables

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,

Reduction Theorems for Proving Serializability with Application to RCU-Based Synchronization Hagit Attiya Technion Work with Ramalingam and Rinetzky (POPL.

Read-Copy-Update Synchronization in the Linux Kernel 1 David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.

CS510 Concurrent Systems “What is RCU, Fundamentally?” By Paul McKenney and Jonathan Walpole Daniel Mansour (Slides adapted from Professor Walpole’s)

Kernel Synchronization David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis St. Louis, MO

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

CS 540 Database Management Systems

Hazard Pointers C++ Memory Ordering Issues

Combining HTM and RCU to Implement Highly Efficient Balanced Binary Search Trees Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios.

Threads Cannot Be Implemented As a Library

Faster Data Structures in Transactional Memory using Three Paths

Martin Rinard Laboratory for Computer Science

CS510 Concurrent Systems Jonathan Walpole.

Threads and Memory Models Hal Perkins Autumn 2011

Concurrent Data Structures Concurrent Algorithms 2016

Concurrent Data Structures Concurrent Algorithms 2017

CS510 Concurrent Systems Jonathan Walpole.

Kernel Synchronization II

Practical Concerns for Scalable Synchronization

Threads and Memory Models Hal Perkins Autumn 2009

Hybrid Transactional Memory

A Concurrent Lock-Free Priority Queue for Multi-Thread Systems

Kernel Synchronization I

Kernel Synchronization II

Multicore programming

Controlled Interleaving for Transactions

MV-RLU: Scaling Read-Log-Update with Multi-Versioning

Presentation transcript:

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE) Patrick Marlier (UNINE)

Multicore Revolution Need concurrent data-structures New programming frameworks for concurrency

The Key to Performance in Concurrent Data-Structures Unsynchronized traversals: sequences of reads without locks, memory fences or writes – 90% of the time is spent traversing data Multi-location atomic updates – Hide race conditions from programmers

RCU Read-Copy-Update (RCU), introduced by McKenney, is a programming framework that provides built-in support for unsynchronized traversals

RCU Pros: – Very efficient (no overhead for readers) – Popular, Linux kernel has 6,500+ RCU calls Cons: – Hard to program (in non-trivial cases) – Allows only single pointer updates Supports unsynchronized traversals but not multi-location atomic updates

This Paper — RLU Read-Log-Update (RLU), an extension to RCU that provides both unsynchronized traversals and multi-location atomic updates within a single framework – Key benefit: Simplifies RCU programming – Key challenge: Preserves RCU efficiency

RCU Overview Key Idea 1.To modify objects: Duplicate them and modify copies  Provides unsynchronized traversals 2.To commit: Use a single pointer update to make new copies reachable and old copies unreachable  Must happen all at once!

RCU Key Idea AB CD P Update(C) 8 P Writer-Lock PP C’ QQQQ Lookup(C) (1)Duplicate C (2) Single pointer update: make C’ reachable and C unreachable (1)Duplicate C (2) Single pointer update: make C’ reachable and C unreachable P C’ How to deallocate C?

How to free objects? RCU-Epoch: a time interval after which it is safe to deallocate objects – Waits for all current read operations to finish RCU-Duplication + RCU-Epoch provide: – Unsynchronized traversals AND – Memory reclamation  This makes RCU efficient and practical  But, RCU allows only single pointer updates

AB CD Update(even nodes) PQ Lookup(even nodes) D’ Q sees B’ but not D’: an inconsistent mix Q sees B’ but not D’: an inconsistent mix E B’ The Problem RCU Single Pointer Updates QQQ

RCU is Complex Applying RCU beyond a linked list is worth a paper in a top conference: – RCU resizable hash tables (Triplett, McKenney, Walpole => USENIX ATC-11) – RCU balanced trees (Clements, Kaashoek, Zeldovich => ASPLOS-12) – RCU citrus trees (Arbel, Attiya => PODC-14, Arbel, Morrison => PPoPP-15)

Our Work Read-Log-Update (RLU), an extension to RCU that adds support for multi- pointer atomic updates Key Idea: Use a global clock + per thread logs

AB CD PQ D’ E B’ A log/buffer to store copies (per-thread) Log RLU header Global Clock (22) Local Clock (22) Write Clock ( ∞ ) Read on start Used on commit RLU Clocks and Logs

Write Clock ( ∞ ) Global Clock (22) AB CD P C’ Q D’ E B’ 1. P updates clocks 2. P executes RCU-epoch  Waits for Q to finish 1. P updates clocks 2. P executes RCU-epoch  Waits for Q to finish Global Clock (23) Local Clock (22) Write Clock (23) Steal copy when: Local Clock >= Write Clock Z Local Clock (23) Z will read only new objects Q will read only old objects RLU Commit – Phase 1

Global Clock (23) Write Clock (23) A C P C’ D’ E B’ 3. P writes back log 4. P resets write clock 5. P swaps logs (current log is safe for re-use after next commit) 3. P writes back log 4. P resets write clock 5. P swaps logs (current log is safe for re-use after next commit) Write Clock ( ∞ ) RLU Commit – Phase 2 B D B’ D’ B’

RLU Programming RLU API extends the RCU API: – rcu_dereference(..) / rlu_dereference(..) – rcu_assign_pointer(..) / rlu_assign_pointer(..) – … RLU adds a new call: rlu_try_lock(..) – To modify object => Lock it – Provides multi-location atomic updates Hides object duplications and manipulations

Programming Example List Delete with a Mutex void RLU_list_delete(list_t *list, int val) { spin_lock(&writer_lock); rlu_reader_lock(); prev = rlu_dereference(list->head); curr = rlu_dereference(prev->next); while (curr->val < val) { prev = curr; curr = rlu_dereference(prev->next); } next = rlu_dereference(curr->next); rlu_try_lock(&prev) rlu_assign_ptr(&(prev->next), next); rlu_free(curr); rlu_reader_unlock(); spin_lock(&writer_lock); } Acquire mutex and start Acquire mutex and start Find node Delete node Finish and release mutex How can we eliminate the mutex?

RCU + Fine-Grained Locks AB CE P Insert(D) 18 PPPQQQQ Delete(C) Locking “prev” and “curr” is not enough: Thread Q may delete or insert new nodes concurrently P Programmers need to add custom post-lock validations. In this case, we need: (1)C.next == E (2)C is reachable from the head Programmers need to add custom post-lock validations. In this case, we need: (1)C.next == E (2)C is reachable from the head

void RCU_list_delete(list_t *list, int val) { restart: rcu_reader_lock(); … find “prev” and “curr” … if (!try_lock(prev) || !try_lock(curr)) { rcu_reader_unlock(); goto restart; } // Validate “prev“ and “curr” if ((curr->is_invalid == 1) || (prev->is_invalid == 1) || (rcu_dereference(prev->next) != curr)) { rcu_reader_unlock(); goto restart; } next = rcu_dereference(curr->next); rcu_assign_ptr(&(prev->next), next); curr->is_invalid = 1; memory_fence(); unlock(prev); unlock(curr); rcu_reader_unlock(); rcu_free(curr); } void RLU_list_delete(list_t *list, int val) { restart: rlu_reader_lock(); … find “prev” and “curr” … if (!rlu_try_lock(prev) || !rlu_try_lock(curr)) { rlu_reader_unlock(); goto restart; } next = rlu_dereference(curr->next); rlu_assign_ptr(&(prev->next), next); rlu_free(curr); rlu_reader_unlock(); } List Delete without a Mutex Find “prev” and “curr” Lock “prev” and “curr” Custom post-lock validations Delete “curr” and finish Find “prev” and “curr” Lock “prev” and “curr” Delete “curr” and finish. No post-lock validations necessary! Delete “curr” and finish. No post-lock validations necessary!

Performance RLU is optimized for read-dominated workloads (like RCU): – RLU object lock checks are fast because: Locks are co-located with the objects Stealing is usually rare – RLU writers are more expensive than RCU writers: Not significant for read-dominated workloads Tested in userspace and kernel

Userspace Hash Table and Linked-List (Kernel is similar)

Applying RLU to Kyoto CacheDB Kyoto CacheDB uses: – A reader-writer lock – A per slot lock (DB is broken into slots)  The reader-writer lock is a serial bottleneck  Use RLU to eliminate this lock  It was easy to apply: – Use slot locks to serialize writers to the same slot – Simply lock each object before modification

RLU and Original Kyoto CacheDB

Conclusion RLU adds multi-pointer atomic updates to RCU while maintaining efficiency both in userspace and kernel Much more in the paper – Optimizations (deferral) – Benchmarks (kernel, Citrus, resizable hash table) RLU is available as open source (MIT license):

Thank You

Appendix 1.RLU-Defer 2.Kernel Tests 3.RCU vs RLU resizable hash table

RLU-Defer RLU writers are slower since they need to execute wait-for-readers. RLU-Defer reduces these costs (by 10x). – Note that wait-for-readers write-backs and unlocks objects. – But unlocking is only needed for a write-write conflict, so RLU-Defer executes wait-for-readers only when a write-write conflict occurs.

RLU-Defer RLU-Defer is significant for many threads

Kernel Tests

Resizable Hash Table Code Comparison

Resizable Hash Table Performance