Non-blocking Data Structures for High- Performance Computing Håkan Sundell, PhD.

Slides:



Advertisements
Similar presentations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Advertisements

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
Operating Systems Part III: Process Management (Process Synchronization)
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Scalable and Lock-Free Concurrent Dictionaries
(C) Ph. Tsigas © Ph. Tsigas Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas.
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Håkan Sundell, Chalmers University of Technology 1 Space Efficient Wait-free Buffer Sharing in Multiprocessor Real-time Systems Based.
Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.
E81 CSE 532S: Advanced Multi-Paradigm Software Development Chris Gill Department of Computer Science and Engineering Washington University in St. Louis.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
Algorithms for Synchronization and Consistency in Concurrent System Services Anders Gidenstam Distributed Computing and Systems group, Department of Computer.
1 Lock-Free Linked Lists Using Compare-and-Swap by John Valois Speaker’s Name: Talk Title: Larry Bush.
Practical and Lock-Free Doubly Linked Lists Håkan Sundell Philippas Tsigas.
Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Håkan Sundell, Chalmers University of Technology 1 Using Timing Information on Wait-Free Algorithms in Real-Time Systems (2 papers)
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
November 15, 2007 A Java Implementation of a Lock- Free Concurrent Priority Queue Bart Verzijlenberg.
Håkan Sundell, Chalmers University of Technology 1 Applications of Non-Blocking Data Structures to Real-Time Systems Seminar for the.
Håkan Sundell, Chalmers University of Technology 1 Simple and Fast Wait-Free Snapshots for Real-Time Systems Håkan Sundell Philippas.
A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.
Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
CSE 425: Concurrency II Semaphores and Mutexes Can avoid bad inter-leavings by acquiring locks –Guard access to a shared resource to take turns using it.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
Practical concurrent algorithms Mihai Letia Concurrent Algorithms 2012 Distributed Programming Laboratory Slides by Aleksandar Dragojevic.
Darko Makreshanski Department of Computer Science ETH Zurich
Skiplist-based Concurrent Priority Queues Itay Lotan Stanford University Nir Shavit Sun Microsystems Laboratories.
A Simple Optimistic skip-list Algorithm Maurice Herlihy Brown University & Sun Microsystems Laboratories Yossi Lev Brown University & Sun Microsystems.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.
The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.
SkipLists and Balanced Search The Art Of MultiProcessor Programming Maurice Herlihy & Nir Shavit Chapter 14 Avi Kozokin.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
Scalable lock-free Stack Algorithm Wael Yehia York University February 8, 2010.
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Håkan Sundell Philippas Tsigas
Lock-Free Linked Lists Using Compare-and-Swap
The University of Adelaide, School of Computer Science
Challenges in Concurrent Computing
A Lock-Free Algorithm for Concurrent Bags
Expander: Lock-free Cache for a Concurrent Data Structure
Anders Gidenstam Håkan Sundell Philippas Tsigas
Concurrent Data Structures Concurrent Algorithms 2017
NOBLE: A Non-Blocking Inter-Process Communication Library
A Concurrent Lock-Free Priority Queue for Multi-Thread Systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Presentation transcript:

Non-blocking Data Structures for High- Performance Computing Håkan Sundell, PhD

5 August 2005EPCC Outline Shared Memory Synchronization Methods Memory Management Shared Data Structures Dictionary Performance Conclusions

5 August 2005EPCC Shared Memory CPU Cache Cache bus Memory... - Uniform Memory Access (UMA) - Non-Uniform Memory Access (NUMA)

5 August 2005EPCC Synchronization Shared data structures needs synchronization! Accesses and updates must be coordinated to establish consistency. P1 P2 P3

5 August 2005EPCC Hardware Synchronization Primitives Consensus 1 Atomic Read/Write Consensus 2 Atomic Test-And-Set (TAS), Fetch-And-Add (FAA), Swap Consensus Infinite Atomic Compare-And-Swap (CAS) Atomic Load-Linked/Store-Conditionally Read Write Read M=f(M,…)

5 August 2005EPCC Mutual Exclusion Access to shared data will be atomic because of lock Reduced Parallelism by definition Blocking, Danger of priority inversion and deadlocks. Solutions exists, but with high overhead, especially for multi-processor systems P1 P2 P3

5 August 2005EPCC Non-blocking Synchronization Perform operation/changes using atomic primitives Lock-Free Synchronization Optimistic approach Retries until succeeding Guarantees progress of at least one operation Wait-Free Synchronization Always finishes in a finite number of its own steps Coordination with all participants

5 August 2005EPCC Memory Management Dynamic data structures need dynamic memory management Concurrent D.S. need concurrent M.M.!

5 August 2005EPCC Concurrent Memory Management Concurrent Memory Allocation i.e. malloc/free functionality Concurrent Garbage Collection Questions (among many): When to re-use memory? How to de-reference pointers safely? P2P1P3

5 August 2005EPCC Lock-Free Memory Management Memory Allocation Valois 1995: fixed block-size, fixed purpose Michael 2004: Gidenstam et al. 2004, any size, any purpose Garbage Collection Valois 1995, Detlefs et al. 2001: reference counting Michael 2002, Herlihy et al. 2002: hazard pointers Gidenstam, Papatriantafilou, Sundell and Tsigas 2005: hazard pointer + reference counting

5 August 2005EPCC Lock-Free Reference Counting De-referencing links 1. Read the link contents, i.e. a pointer. 2. Increment (FAA) the reference count on the corresponding object. What if the link is changed between step 1 and 2? Solution by Detlefs et al: Use DCAS on step 2 that operates on two arbitrary memory words. Retries if link is changed after step 2. Solution by Valois et al: The reference count field is present indefinitely. Decrement reference count and retries if link is changed after step 2.

5 August 2005EPCC Lock-Free Hazard Pointers (Michael 2002) De-referencing links 1. Read the link contents, i.e. a pointer. 2. Set a hazard pointer to the read pointer value. 3. Read the link contents again; if not same as in step 1 then restart from step 1. Deletion After deleted from data structure, put node on a local list. When the local list reaches a certain size; scan all hazard pointers globally, reclaim memory of all nodes which address does not match the scan.

5 August 2005EPCC Lock-Free Memory Allocation Solution (lock-free), IBM freelists: Create a linked-list of the free nodes, allocate/reclaim using CAS Needs some mechanism to avoid the ABA problem. HeadMem 1Mem 2Mem n … Used 1 Reclaim Allocate

5 August 2005EPCC Shared Data Structure: Dictionaries (Sets) Fundamental data structure Works on a set of pairs Three basic operations: Insert(k,v): Adds a new item v=FindKey(k): Finds the item v=DeleteKey(k): Finds and removes the item

5 August 2005EPCC Randomized Algorithm: Skip Lists William Pugh: ”Skip Lists: A Probabilistic Alternative to Balanced Trees”, 1990 Layers of ordered lists with different densities, achieves a tree-like behavior Time complexity: O(log 2 N) – probabilistic! HeadTail 50% 25% …

5 August 2005EPCC New Lock-Free Concurrent Skip List Define node state to depend on the insertion status at lowest level as well as a deletion flag Insert from lowest level going upwards Set deletion flag. Delete from highest level going downwards DDDDDDD p p D

5 August 2005EPCC Overlapping operations on shared data Example: Insert operation - which of 2 or 3 gets inserted? Solution: Compare-And-Swap atomic primitive: CAS(p:pointer to word, old:word, new:word):boolean atomic do if *p = old then *p := new; return true; else return false; Insert 3 Insert 2

5 August 2005EPCC Concurrent Insert vs. Delete operations Problem: - both nodes are deleted! Solution (Harris et al): Use bit 0 of pointer to mark deletion status Delete Insert a) b) * a) b) c)

5 August 2005EPCC Helping Scheme Threads need to traverse safely Need to remove marked-to-be-deleted nodes while traversing – Help! Finds previous node, finish deletion and continues traversing from previous node 1 42* 1 42* or ? ? 1 42*

5 August 2005EPCC Lock-Free Skip List - Techniques Summary The Skip List is treated as layers of ordered lists Uses CAS atomic primitive Lock-Free memory management IBM Freelists Reference counting (Valois+Michael&Scott) Helping scheme Back-Off strategy All together proved to be linearizable

5 August 2005EPCC Lock-Free Skip List publications First publications in literature: H. Sundell and P. Tsigas, ”Fast and Lock- Free Concurrent Priority Queues for Multi- thread Systems”, IPDPS 2003 H. Sundell and P. Tsigas, ”Scalable and Lock-Free Concurrent Dictionaries”, SAC 2004 Later publications: M. Fomitchev and E. Ruppert, “Lock-free linked lists and skip lists”, PODC 2004 K. Fraser, “Practical lock-freedom”, PhD thesis, 2004

5 August 2005EPCC New Lock-Free Skip List ! The thread that fulfils the deletion of a node removes the next pointer when finished. Allows other threads to traverse through even marked next pointers. If not possible to traverse forward, go back to the remembered position on previous (upper) levels. Helps deletions-in-progress only when absolutely necessary. Works with a modified version of Michael’s Hazard Pointer memory management!

5 August 2005EPCC Correctness Linearizability (Herlihy 1991) In order for an implementation to be linearizable, for every concurrent execution, there should exist an equal sequential execution that respects the partial order of the operations in the concurrent execution

5 August 2005EPCC Correctness Define precise sequential semantics Define abstract state and its interpretation Show that state is atomically updated Define linearizability points Show that operations take effect atomically at these points with respect to sequential semantics Creates a total order using the linearizability points that respects the partial order The algorithm is linearizable

5 August 2005EPCC Memory Consistency and Out-Of-Order execution Models on actual multiprocessor architectures: Relaxed Memory Order etc. Must insert special machine instructions (memory barriers) to enforce stronger memory consistency models! t W(x,1) TiTi TjTj TkTk W(y,0) R(y)=0W(x,0) R(x)=1 W(y,1) R(y)=1 R(x)=1 R(x)=0 R(y)=1

5 August 2005EPCC Experiments Experiment with 1-32 threads performed on Sun Fire 15K with 48 cpu’s. Each thread performs operations, whereof the first total operations are Insert’s, remaining are equally randomly distributed over Insert, FindKey and DeleteKey’s. Fixed Skiplist maximum level of 10. Compare with implementations of other skip list-based dictionaries and a singly linked list by Michael, using same scenarios. Averaged execution time of 10 experiments.

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC Multi-Word Compare-And- Swap Operations: bool CASN(int *p1, int o1, int n1, …); int Read(int *p); Standard algoritmic approach: 1. Try to acquire a lock on all positions of interest. 2. If already taken, help corresponding operation 3. If all taken and all match change status of operation 4. Remove locks and possibly write new values My approach: Wait-free memory management (IPDPS 2005) Lock stealing and lock hand-over Allow un-sorted pointers

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC Lock-Free Deque Practical algorithms in literature: Michael 2003, ”CAS-based lock-free algorithm for shared deques”, Euro-Par 2003 Sundell and Tsigas, ”Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap”, OPODIS 2004 Approach Apply new memory management on lock- free deque

5 August 2005EPCC

5 August 2005EPCC

5 August 2005EPCC Conclusions Work performed at EPCC Improved algorithm of lock-free skip list Improved Michael’s hazard pointer algorithm Experiments comparing with other recent dictionary algorithms New implementation of CASN. Experiments comparing with other recent CASN algorithms. Experiments comparing a lock-free deque algorithm using different memory management techniques. Future work Implement new lock-free/ wait-free dynamic data structures. More experiments.

5 August 2005EPCC Questions? Contact Information: Address: Håkan Sundell Computing Science Chalmers University of Technology Web: