Concurrent Data Structures Concurrent Algorithms 2016

Slides:

Advertisements

Similar presentations

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+

Advertisements

Hazard Pointers: Safe Memory Reclamation of Lock-Free Objects

Wait-Free Queues with Multiple Enqueuers and Dequeuers

Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock- Free Objects” Presentation Robert T. Bauer.

CS510 Concurrent Systems Jonathan Walpole. What is RCU, Fundamentally? Paul McKenney and Jonathan Walpole.

Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

What is RCU, fundamentally? Sri Ramkrishna. Intro RCU stands for Read Copy Update  A synchronization method that allows reads to occur concurrently with.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Relativistic Red Black Trees. Relativistic Programming Concurrent reading and writing improves performance and scalability – concurrent readers may disagree.

A Simple Optimistic skip-list Algorithm Maurice Herlihy Brown University & Sun Microsystems Laboratories Yossi Lev Brown University & Sun Microsystems.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

SkipLists and Balanced Search The Art Of MultiProcessor Programming Maurice Herlihy & Nir Shavit Chapter 14 Avi Kozokin.

An algorithm of Lock-free extensible hash table Yi Feng.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE)

C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 17: Linked Lists.

C++ Programming: From Problem Analysis to Program Design, Fourth Edition Chapter 18: Linked Lists.

Read-Copy-Update Synchronization in the Linux Kernel 1 David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis.

Lecture 9 : Concurrent Data Structures Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

6.852 Lecture 21 ● Techniques for highly concurrent objects – coarse-grained mutual exclusion – read/write locking – fine-grained locking (mutex and read/write)

File-System Management

C++ Programming:. Program Design Including

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

CHP - 9 File Structures.

Top 50 Data Structures Interview Questions

Hazard Pointers C++ Memory Ordering Issues

Review Deleting an Element from a Linked List Deletion involves:

Combining HTM and RCU to Implement Highly Efficient Balanced Binary Search Trees Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios.

Håkan Sundell Philippas Tsigas

Lock-Free Linked Lists Using Compare-and-Swap

Data Structures Interview / VIVA Questions and Answers

A Lock-Free Algorithm for Concurrent Bags

Expander: Lock-free Cache for a Concurrent Data Structure

CS510 Concurrent Systems Jonathan Walpole.

Practical Non-blocking Unordered Lists

Anders Gidenstam Håkan Sundell Philippas Tsigas

Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.

Arrays and Linked Lists

Concurrent Data Structures Concurrent Algorithms 2017

CS510 Concurrent Systems Jonathan Walpole.

Kernel Synchronization II

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

CS510 - Portland State University

Multicore programming

Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.

Lecture 22: Consistency Models, TM

Multicore programming

Software Transactional Memory Should Not be Obstruction-Free

A Concurrent Lock-Free Priority Queue for Multi-Thread Systems

CS 3410, Spring 2014 Computer Science Cornell University

Kernel Synchronization II

Multicore programming

Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.

Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.

CS510 Advanced Topics in Concurrency

File Organization.

CS510 Concurrent Systems Jonathan Walpole.

File System Performance

MV-RLU: Scaling Read-Log-Update with Multi-Versioning

Presentation transcript:

Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis)

Constructs for efficiently storing and retrieving data Data Structures (DSs) Constructs for efficiently storing and retrieving data Different types: lists, hash tables, trees, queues, … Accessed through the DS interface Depends on the DS type, but always includes Store an element Retrieve an element Element Set: just one value Map: key/value pair

Concurrent Data Structures (CDSs) Concurrently accessed by multiple threads Through the CDS interface  linearizable operations! Really important on multi-cores Used in most software systems

What do we care about in practice? Progress of individual operations - sometimes More often: Number of operations per second (throughput) The evolution of throughput as we increase the number of threads (scalability)

DS Example: Linked List delete(6) 1 2 3 5 6 8 4 insert(4) A sequence of elements (nodes) Interface search (aka contains) insert remove (aka delete) struct node { value_t value; struct node* next; };

Search Data Structures Interface search insert remove Semantics read-only read-write search(k) k parse(k) update(k) modify(k) updates

Optimistic vs. Pessimistic Concurrency 20-core Xeon 1024 elements traverse optimistic lock traverse pessimistic (Lesson1) Optimistic concurrency is the only way to get scalability

The Two Problems in Optimistic Concurrency Concurrency Control How threads synchronize their writes to the shared memory (e.g., nodes) Locks CAS Transactional memory Memory Reclamation How and when threads free and reuse the shared memory (e.g., nodes) Garbage collectors Hazard pointers RCU Quiescent states

Tools for Optimistic Concurrency Control (OCC) RCU: slow in the presence of updates (also a memory reclamation scheme) STM: slow in general HTM: not ubiquitous, not very fast (yet) Wait-free algorithms: slow in general (Optimistic) Lock-free algorithms:  Optimistic lock-based algorithms:  We either need a lock-free or an optimistic lock-based algorithm

Parenthesis: Target platform 2-socket Intel Xeon E5-2680 v2 Ivy Bridge 20 cores @ 2.8 GHz, 40 hyper-threads 25 MB LLC (per socket) 256GB RAM c c

Concurrent Linked Lists – 5% Updates 1024 elements 5% updates Wait-free algorithm is slow 

Optimistic Concurrency in Data Structures operation optimistic prepare (non-synchronized) perform (synchronized) Pattern validate (synchronized) optimistic prepare perform failed detect conflicting concurrent operations find insertion spot validate Example linked list insert insert Validation plays a key role in concurrent data structures

Validation in Concurrent Data Structures Lock-free: atomic operations marking pointers, flags, helping, … Lock-based: lock  validate flags, pointer reversal, parsing twice, … validate & perform (atomic ops) optimistic prepare failed lock optimistic prepare perform unlock validate failed Validation is what differentiates algorithms

Let’s design two concurrent linked lists: A lock-free and a lock-based

Lock-free Sorted Linked List: Naïve find spot return Search find modification spot CAS Insert find modification spot CAS Delete Is this a correct (linearizable) linked list?

Lock-free Sorted Linked List: Naïve – Incorrect What is the problem? Insert involves one existing node; Delete involves two existing nodes P1:CAS P1: find modification spot P0: find modification spot P0:CAS P0: Insert(x) P1: Delete(y) y x Lost update! How can we fix the problem?

Lock-free Sorted Linked List: Fix Idea! To delete a node, make it unusable first… Mark it for deletion so that You fail marking if someone changes next pointer; An insertion fails if the predecessor node is marked. In other words: delete in two steps Mark for deletion; and then Physical deletion 2. CAS(remove) 1. CAS(mark) find modification spot Delete(y) y

1. Failing Deletion (Marking) P1:CAS(mark)  false P1: find modification spot P0: find modification spot P0:CAS P0: Insert(x) P1: Delete(y) y x Upon failure  restart the operation Restarting is part of “all” state-of-the-art-data structures

1. Failing Insertion due to Marked Node P1:CAS(remove) P1:CAS(mark) P1: find modification spot P0: find modification spot P0:CAS  false P0: Insert(x) P1: Delete(y) y Upon failure  restart the operation Restarting is part of “all” state-of-the-art-data structures How can we implement marking?

Implementing Marking (C Style) Pointers in 64 bit architectures Word aligned - 8 bit aligned! next pointer node boolean mark(node_t* n) uintptr_t unmarked = n->next & ~0x1L; uintptr_t marked = n->next | 0x1L; return CAS(&n->next, unmarked, marked) == unmarked;

Lock-free List: Putting Everything Together Traversal: traverse (requires unmarking nodes) Search: traverse Insert: traverse  CAS to insert Delete: traverse  CAS to mark  CAS to remove Garbage (marked) nodes Cleanup while traversing (helping in this course’s terms) What happers if this CAS fails?? A pragmatic implementation of lock-free linked lists

What is not Perfect with the Lock-free List? Garbage nodes Increase path length; and Increase complexity if (is_marked_node(n)) … Unmarking every single pointer Increase complexity curr = get_unmark_ref(curr->next) Can we simplify the design with locks?

Lock-based Sorted Linked List: Naïve find spot return Search find modification spot lock Insert lock(target) find modification spot lock(predecessor) Delete Is this a correct (linearizable) linked list?

Lock-based List: Validate After Locking find spot return Search validate !pred->marked && pred->next did not change find modification spot lock Insert mark(curr) lock(curr) find modification spot lock(predecessor) Delete !pred->marked && !curr->marked && pred->next did not change

Concurrent Linked Lists – 0% updates 1024 elements 0% updates Just because the lock-based is not unmarking! (Lesson2) Sequential complexity matters  Simplicity 

Optimistic Concurrency Control: Summary Lock-free: atomic operations marking pointers, flags, helping, … Lock-based: lock  validate flags, pointer reversal, parsing twice, … validate & perform (atomic ops) optimistic prepare failed lock optimistic prepare perform unlock validate failed

Word of caution: lock-based algorithms Search data structures  Queues, stacks, counters, ... 

Memory Reclamation: OCC’s Side Effect Delete a node  free and reuse this memory Subset of the garbage collection problem Who is accessing that memory? Can we just directly do free(node)? P0: pointer on x P0: search x P1: delete(x) P1: free(x) We cannot directly free the memory! Need memory reclamation

Memory Reclamation Schemes Reference counting Count how many references exist on a node Hazard pointers Tell to others what exactly you are reading Quiescent states Wait until it is certain than no one holds references Read-Copy Update (RCU) Quiescent states – The extreme approach

1. Reference Counting Pointer + Counter rc_pointer Pointer + Counter Dereference: rc_dereference(rc_pointer* rcp) atomic_increment(&rcp->counter); return *pointer; “Release”: rc_release(rc_pointer* rcp) atomic_decrement(&rcp->counter); Free: iff counter = 0 counter pointer (Lesson3) Readers cannot write on the shared nodes Bad bad bad idea: Readers write on shared nodes!

Depends on the data structure type 2. Hazard pointers (1/2) Reference counter  property of the node Hazard pointer  property of the thread A Multi-Reader Single-Writer (MRSW) register Protect: hp_protect(node* n) hazard_pointer* hp = hp_get(n); hp->address = n; Release: hp_release(hazard_pointer* hp) hp->address = NULL; hazard_pointer address Depends on the data structure type

O(data structure size) hazard pointers hp_protect Free memory x Collect all hazard pointers Check if x is accessed by any thread If yes, buffer the free for later If not, free the memory Buffering the free is implementation specific + lock-free - not scalable hazard_pointer address O(data structure size) hazard pointers hp_protect

The data written in qs_[un]safe must be local-mostly 3. Quiescent States Keep the memory until it is certain it is not accessed Can be implemented in various ways Example implementation search / insert / delete qs_unsafe(); I’m accessing shared data … qs_safe(); I’m not accessing shared data return … The data written in qs_[un]safe must be local-mostly

3. Quiescent States: qs_[un]safe Implementation List of “thread-local” (mostly) counters qs_state (initialized to 0) even : in safe mode (not accessing shared data) odd : in unsafe mode qs_safe / qs_unsafe qs_state++; (id = 0) qs_state (id = x) qs_state (id = y) qs_state How do we free memory?

3. Quiescent States: Freeing memory List of “thread-local” (mostly) counters Upon qs_free: Timestamp memory (vector_ts) Can do this for batches of frees Safe to reuse the memory vector_tsnow >> vector_tsmem (id = 0) qs_state (id = x) qs_state (id = y) qs_state for t in thread_ids if (vts_mem[t] is odd && vts_now[t] = vts_mem[t]) return false; return true; How do the schemes we have seen perform?

Hazard Pointers vs. Quiescent States 1024 elements 0% updates Quiescent-state reclamation is as fast as it gets

4. Read-Copy Update (RCU) Quiescent states at their extreme Deletions wait all readers to reach a safe state Introduced in the Linux kernel in ~2002 More than 10000 uses in the kernel! (Example) Interface rcu_read_lock (= qs_unsafe) rcu_read_unlock (= qs_safe) synchronize_rcu  wait all readers

+ read-only workloads - bad for writes 4. Using RCU Search / Traverse rcu_read_lock() … rcu_read_unlock() + simple + read-only workloads - bad for writes Delete … physical deletion of x synchronize_rcu() free(x)

Memory Reclamation: Summary How and when to reuse freed memory Many techniques, no silver bullet Reference counting Hazard pointers Quiescent states Read-Copy Update (RCU)

Concurrent data structures are very important Summary Concurrent data structures are very important Optimistic concurrency necessary for scalability Only recently a lot of active work for CDSs Memory reclamation is Inherent to optimistic concurrency; A difficult problem; A potential performance/scalability bottleneck