Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+

Slides:



Advertisements
Similar presentations
David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
Advertisements

Ordered linked list implementation of a set
Operating Systems: Monitors 1 Monitors (C.A.R. Hoare) higher level construct than semaphores a package of grouped procedures, variables and data i.e. object.
Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.
Wait-Free Queues with Multiple Enqueuers and Dequeuers
Progress with Progress Guarantees Erez Petrank - Technion Based on joint work with Anastasia Braginsky, Alex Kogan, Madanlal Musuvathi, Filip Pizlo, and.
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AAA A A A AA A Proving that non-blocking algorithms don't block.
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
© 2005 P. Kouznetsov Computing with Reads and Writes in the Absence of Step Contention Hagit Attiya Rachid Guerraoui Petr Kouznetsov School of Computer.
1 Chapter 4 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.
Previously… Processes –Process States –Context Switching –Process Queues Threads –Thread Mappings Scheduling –FCFS –SJF –Priority scheduling –Round Robin.
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Progress Guarantee for Parallel Programs via Bounded Lock-Freedom Erez Petrank – Technion Madanlal Musuvathi- Microsoft Bjarne Steensgaard - Microsoft.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
02/17/2010CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
Recursion Chapter 7. Chapter 7: Recursion2 Chapter Objectives To understand how to think recursively To learn how to trace a recursive method To learn.
Recursion Chapter 7. Chapter 7: Recursion2 Chapter Objectives To understand how to think recursively To learn how to trace a recursive method To learn.
02/19/2007CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
1 Thread Synchronization: Too Much Milk. 2 Implementing Critical Sections in Software Hard The following example will demonstrate the difficulty of providing.
Copyright 2007 Sun Microsystems, Inc SNZI: Scalable Non-Zero Indicator Yossi Lev (Brown University & Sun Microsystems Laboratories) Joint work with: Faith.
Recursion Chapter 7. Chapter Objectives  To understand how to think recursively  To learn how to trace a recursive method  To learn how to write recursive.
Section 3.1: Proof Strategy Now that we have a fair amount of experience with proofs, we will start to prove more difficult theorems. Our experience so.
November 15, 2007 A Java Implementation of a Lock- Free Concurrent Priority Queue Bart Verzijlenberg.
Games Development 2 Concurrent Programming CO3301 Week 9.
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
CSC321 Concurrent Programming: §5 Monitors 1 Section 5 Monitors.
Reasoning about programs March CSE 403, Winter 2011, Brun.
1 Lock-Free concurrent algorithm for Linked lists: Verification CSE-COSC6490A : Concurrent Object-Oriented Languages York University - W09 Speaker: Alexandre.
CY2003 Computer Systems Lecture 04 Interprocess Communication.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Non-Blocking Concurrent Data Objects With Abstract Concurrency By Jack Pribble Based on, “A Methodology for Implementing Highly Concurrent Data Objects,”
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
A Simple Optimistic skip-list Algorithm Maurice Herlihy Brown University & Sun Microsystems Laboratories Yossi Lev Brown University & Sun Microsystems.
SPL/2010 Guarded Methods and Waiting 1. SPL/2010 Reminder! ● Concurrency problem: asynchronous modifications to object states lead to failure of thread.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Debugging Threaded Applications By Andrew Binstock CMPS Parallel.
Detecting Atomicity Violations via Access Interleaving Invariants
Concurrent Hashing and Natural Parallelism Chapter 13 in The Art of Multiprocessor Programming Instructor: Erez Petrank Presented by Tomer Hermelin.
November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.
SkipLists and Balanced Search The Art Of MultiProcessor Programming Maurice Herlihy & Nir Shavit Chapter 14 Avi Kozokin.
Lists List Implementations. 2 Linked List Review Recall from CMSC 201 –“A linked list is a linear collection of self- referential structures, called nodes,
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Lock Free Linked List and Skip List
Parallelism and Concurrency
Lecture 25 More Synchronized Data and Producer/Consumer Relationship
Expander: Lock-free Cache for a Concurrent Data Structure
Practical Non-blocking Unordered Lists
Concurrent Data Structures Concurrent Algorithms 2017
Multicore programming
Multicore programming
Multicore programming
CSE 542: Operating Systems
CSE 542: Operating Systems
Don Porter Portions courtesy Emmett Witchel
Presentation transcript:

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+

Our Contribution A fast, wait-free linked-list The first wait-free list fast enough to be used in practice

Agenda What is a wait-free linked-list? Related work and existing tools Wait-Free Linked-List design Performance 3

Concurrent Data Structures Allow several threads to read or modify the data-structure simultaneously Increasing demands due to highly-parallel systems

Progress Guarantees Obstruction Free – A thread running exclusively will make a progress Lock Free – At least one of the running threads will make a progress Wait Free – every thread that gets the CPU will make a progress.

Wait Free Algorithms Provides the strongest progress guarantee Always desirable, particularly in real-time systems. Relatively rare Hard to design Typically slower

The Linked List Interface Following the traditional choice; a sorted list-based set of integers insert(int x); delete(int x); contains(int x); 469-+

Prior Wait-Free Lists Only Universal Constructions Non-scalable (by nature ?) Achieve good complexity, but poor performance State-of-the-art construction (Chuong, Ellen, Ramachandran) significantly under-perform our construction.

Our wait-free versus a universal construction

Linked-Lists with Progress Guarantee No practical wait-free linked-lists available Lock-free linked-lists exists Most notably: Harriss linked-list

Existing Lock-Free List (by Harris) Deletion in two steps Logical: Mark the next field using a CAS Physical: Remove the node

Existing Lock-Free List (by Harris) Use the least significant bit in each next field, as a mark bit The mark bit signals that a node is logically deleted The Nodes next field cannot be changed (the CAS will fail) if it is logically deleted

Help Mechanism A common technique to achieve wait- freedom Each thread declares in a designated state array the operation it desires Many threads may attempt to execute it

Help Mechanism - Difficulties Multiple threads should be able to work concurrently on the same operation Many potential races Difficult to design Usually slower

Help Mechanism – Usually slower Much more synchronization needed At times many threads are attempting to help the same operation (Can we use help without suffering slower execution ?)

Means of Using Help Serialize everything (non-scalable) Eager help Delayed help Fast-Path-Slow-Path

Complication: Deletion Owning T1, T2 both attempt delete(6) 469-+

Complication: Deletion Owning T1, T2 both attempt delete(6) T1, T2 both declare in the state array 469-+

Complication: Deletion Owning T1, T2 both attempt delete(6) T1, T2 both declare in the state array T3 sees T1 declaration and tries to help it, while T4 helps T

Complication: Deletion Owning T1, T2 both attempt delete(6) T1, T2 both declare in the state array T3 sees T1 declaration and tries to help it, while T4 helps T

Complication: Deletion Owning If both helpers T3, T4 go to sleep after the mark was done, which thread (T1 or T2) should return true and which false? 469-+

Means of Using Help Serialize everything (non-scalable) Eager help Delayed help Fast-Path-Slow-Path In A Serialized help, this could not happen!

"Solution: use a success bit Each node holds an extra success bit (initially 0) Potential owners compete to CAS it to 1 (no help in this part) Note the node is deleted before it is decided which thread owns its deletion

Means of Using Help Serialize everything (non-scalable) Eager help Delayed help Fast-Path-Slow-Path Opportunistic help

Helping an Insert Operation Search Direct Insert Report

Helping an Insert Operation Search Direct Insert Report 469 Status: Pending Operation: Insert New node: 7

Helping an Insert Operation Search Direct Insert Report 469 Status: Pending Operation: Insert New node: 7

Helping an Insert Operation Search Direct Insert Report 469 Status: Pending Operation: Insert New node: 7

Helping an Insert Operation Search Direct Insert Report 469 CAS Status: Pending Operation: Insert New node: 7

Helping an Insert Operation Search Direct Insert Report 469 Status: Pending Operation: Insert New node: 7 Status: Success Operation: Insert New node: CAS

Helping an Insert Operation helpInsert (state s) { (pred,succ) = search(s.key) /* Search */ if (found key) CAS (state[tid],s, failure); /* Report Failure */ else { s.node.next = succ; /* Direct */ if (CAS(pred.next, succ, s.node) /* Insert */ CAS(state[tid],s,success); /* Report */ }

Good Enough ? helpInsert (state s) { (pred,succ) = search(s.key) /* Search */ if (found key) CAS (state[tid],s, failure); /* Report Failure */ else { s.node.next = succ; /* Direct */ if (CAS(pred.next, succ, s.node) /* Insert */ CAS(state[tid],s,success); /* Report */ }

Incorrect Result Returned consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 inserts new node. CAS(state[tid],s,success) } 469 T2 { found(6,7) CAS(state[tid],s,failure) } 7

Incorrect Result Returned consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 inserts new node CAS(state[tid],s,success) } 469 T2 { found(6,7) CAS(state[tid],s,failure) } 7

Incorrect Result Returned consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 inserts new node CAS(state[tid],s,success) } 469 T2 { found(6,7) CAS(state[tid],s,failure) } 7

Incorrect Result Returned consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 inserts new node. CAS(state[tid],s,success) } 469 T2 { found(6,7) CAS(state[tid],s,failure) } 7

Incorrect Result Returned consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 inserts new node. CAS(state[tid],s,success) } 469 T2 { found(6,7) CAS(state[tid],s,failure) } 7

Incorrect Result Returned consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 inserts new node CAS(state[tid],s,success) } 469 T2 { found(6,7) CAS(state[tid],s,failure) } 7

Fix: helpInsert (state s) { (pred,succ) = search(s.key) if (found key && foundNode != s.node) CAS (state[tid],s, failure); else { s.node.next = succ; if (CAS(pred.next, succ, s.node) CAS(state[tid],s,success); } Good Enough ?

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success ) } 469 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 7

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 469 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 7

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 467 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 9

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 467 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 9

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 467 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 97

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 467 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 97

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 467 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 97

Incorrect Result Returned 2 T1 { found (6,9) node.next = &9 inserts new node CAS(->success) } 467 T2 { found(6,7) CAS(->failure} T3 { Delete(7) Insert(7) } 97

Fix: helpInsert (state s) { (pred,succ) = search(s.key) if (found key && foundNode != s.node && !s.node.marked) CAS (state[tid],s, failure); else { s.node.next = succ; if (CAS(pred.next, succ, s.node) CAS(state[tid],s,success); } Good Enough ?

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 469 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 7

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 469 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 7

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 469 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 7

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 469 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 7

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 469 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 7

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 468 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 79

Ill-timed Direct consider 2 threads helping insert(7) T1 { found (6,9) node.next = &9 } 468 T2 { found (6,9) node.next = &9 inserts the new node CAS(->success)...Insert(8) (after 7) } 79

Fix: helpInsert (state s) { node_next = s.node.next; (pred,succ) = search(s.key)/*Remove marked*/ if (found key && foundNode != s.node && !s.node.marked) CAS (state[tid],s, failure); else { s.node.next = succ; if (!CAS(s.node.next, node_next, succ)) restart if (CAS(pred.next, succ, s.node) CAS(state[tid],s,success); } Good Enough ?

More Races Exist Additional races were handled in both the delete and insert operations We constructed a formal proof for the correctness of the algorithm

Main Invariant Each modification of a nodes next field belongs into one of four categories Marking (change the mark bit to true) Snipping (removing a marked node) Redirection (of an infant node) Insertion (a non-infant to an infant) Proof by induction and by following the code lines

Fast-Path-Slow-Path (Kogan and Petrank, PPOPP 2012) Each thread: Tries to complete the operation without help Asks For help Only if it failed due to contention (Almost) as fast as the lock-free Gives the stronger wait-free guarantee

Fast-Path-Slow-Path Previously implemented for a queue Requires the wait-free algorithm and the lock-free one to work concurrently Our algorithm was carefully chosen to allow a fast-path-slow-path execution

Proof Structure Basic Invariants: A nodes key never changes A marked node is never unmarked A marked nodes next field never changes (And more…)

Proving the Main Invariant Examine each code line that modifies a node Use induction on the execution steps

Performance We measured our Algorithm against Harriss lock-free algorithm We measured our algorithm using Immediate help Deferred help FPSP

Performance We report the results of a micro- benchmark: 1024 possible keys, 512 on average 60% contains, 20% insert, 20% delete Measured on: Intel Xeon (8 concurrent threads) Sun ULTRA SPARC (32 concurrent threads)

Performance

When employing the FPSP technique together with our algorithm: 0-2% difference on Intel (R) Exon (R) 9-11% difference on on UltraSPARC

Conclusions We designed the first practical wait-free linked-list Performance measurement shows our algorithm to work almost as fast as the lock-free list, and give a stronger progress guarantee A formal correctness proof is available

Questions ?