Anders Gidenstam Håkan Sundell Philippas Tsigas

Slides:



Advertisements
Similar presentations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Advertisements

Wait-Free Queues with Multiple Enqueuers and Dequeuers
1 Chapter 4 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock- Free Objects” Presentation Robert T. Bauer.
Scalable and Lock-Free Concurrent Dictionaries
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Håkan Sundell, Chalmers University of Technology 1 Space Efficient Wait-free Buffer Sharing in Multiprocessor Real-time Systems Based.
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.
E81 CSE 532S: Advanced Multi-Paradigm Software Development Chris Gill Department of Computer Science and Engineering Washington University in St. Louis.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
Practical and Lock-Free Doubly Linked Lists Håkan Sundell Philippas Tsigas.
Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
Håkan Sundell, Chalmers University of Technology 1 Using Timing Information on Wait-Free Algorithms in Real-Time Systems (2 papers)
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
Håkan Sundell, Chalmers University of Technology 1 Simple and Fast Wait-Free Snapshots for Real-Time Systems Håkan Sundell Philippas.
A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.
Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.
Non-blocking Data Structures for High- Performance Computing Håkan Sundell, PhD.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
Multiprocessors – Locks
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Lecture 6 of Computer Science II
Review Array Array Elements Accessing array elements
Background on the need for Synchronization
Håkan Sundell Philippas Tsigas
Concurrent Objects Companion slides for
Data Structures Interview / VIVA Questions and Answers
Challenges in Concurrent Computing
A Lock-Free Algorithm for Concurrent Bags
CS510 Concurrent Systems Jonathan Walpole.
Distributed Algorithms (22903)
Practical Non-blocking Unordered Lists
Arrays and Linked Lists
Concurrent Data Structures Concurrent Algorithms 2017
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
CS510 Concurrent Systems Jonathan Walpole.
Lecture 6: Transactions
CS510 - Portland State University
Lecture 21: Synchronization and Consistency
Yiannis Nikolakopoulos
Part 1: Concepts and Hardware- Based Approaches
Lecture 22: Consistency Models, TM
Lecture: Coherence and Synchronization
NOBLE: A Non-Blocking Inter-Process Communication Library
Dr. Mustafa Cem Kasapbaşı
A Concurrent Lock-Free Priority Queue for Multi-Thread Systems
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Problems with Locks Andrew Whitaker CSE451.
Lecture 18: Coherence and Synchronization
Presentation transcript:

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed Computing and Systems group, Department of Computer Science and Engineering, Chalmers University of Technology

Anders Gidenstam, University of Borås Outline Introduction Lock-free synchronization The Problem & Related work The new lock-free queue algorithm Experiments Conclusions 07/11/2018 Anders Gidenstam, University of Borås

Synchronization on a shared object Lock-free synchronization Concurrent operations without enforcing mutual exclusion Avoids: Blocking (or busy waiting), convoy effects, priority inversion and risk of deadlock Progress Guarantee At least one operation always makes progress P1 P2 P3 Op A Op B P4 07/11/2018 Anders Gidenstam, University of Borås

Correctness of a concurrent object Desired semantics of a shared data object Linearizability [Herlihy & Wing, 1990] For each operation invocation there must be one single time instant during its duration where the operation appears to take effect. The observed effects should be consistent with a sequential execution of the operations in that order. O2 O3 O1 07/11/2018 Anders Gidenstam, University of Borås

Correctness of a concurrent object Desired semantics of a shared data object Linearizability [Herlihy & Wing, 1990] For each operation invocation there must be one single time instant during its duration where the operation appears to take effect. The observed effects should be consistent with a sequential execution of the operations in that order. O2 O3 O1 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås System Model Processes can read/write single memory words Synchronization primitives Built into CPU and memory system Atomic read-modify-write (i.e. a critical section of one instruction) Examples: Compare-and-Swap, Load-Linked / Store-Conditional CPU CPU Shared Memory 07/11/2018 Anders Gidenstam, University of Borås

System Model: Memory Consistency CPU core cache Shared Memory (or shared cache) Store buffer A process’ Reads/writes may reach memory out of order Reads of own writes appear in program order Atomic synchronization primitive/instruction Single word Compare-and-Swap Atomic Acts as memory barrier for the process’ own reads and writes All own reads/writes before are done before All own reads/writes after are done after The affected cache block is held exclusively 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås Outline Introduction Lock-free synchronization The Problem & Related work The new lock-free queue algorithm Experiments Conclusions 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Problem Concurrent FIFO queue shared data object Basic operations: enqueue and dequeue Desired Properties Linearizable and Lock-free Dynamic size (maximum only limited by available memory) Bounded memory usage (in terms of live contents) Fast on real systems Tail Head F A B C D E G 07/11/2018 Anders Gidenstam, University of Borås

Related Work: Lock-free Multi-P/C Queues [Michael & Scott, 1996] Linked-list, one element/node Global shared head and tail pointers [Tsigas & Zhang, 2001] Static circular array of elements Two different NULL values for distinguishing initially empty from dequeued elements Global shared head and tail indices, lazily updated [Michael & Scott, 1996] + Elimination [Moir, Nussbaum, Shalev & Shavit, 2005] Same as the above + elimination of concurrent pairs of enqueue and dequeue when the queue is near empty [Hoffman, Shalev & Shavit, 2007] Baskets queue Reduces contention between concurrent enqueues after conflict Needs stronger memory management than M&S (SLFRC or Beware&Cleanup) N-1 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås Outline Introduction Lock-free synchronization The Problem & Related work The new lock-free queue algorithm Experiments Conclusions 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm … Basic idea: Cut and unroll the circular array queue Primary synchronization on the elements Compare-And-Swap (NULL1 -> Value -> NULL2 avoids the ABA problem) Head and tail both move to the right Need an “infinite” array of elements u 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock * Basic idea: Creating an “infinite” array of elements. Divide into blocks of elements, and link them together New empty blocks added as needed Emptied blocks are marked deleted and eventually reclaimed Block fields: Elements, next, (filled, emptied flags), deleted flag. Linked chain of dynamically allocated blocks Lock-free memory management needed for safe reclamation! Beware&Cleanup [Gidenstam, Papatriantafilou, Sundell & Tsigas, 2009] 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock * Thread B headBlock head tailBlock tail Thread local storage Last used Head block/index for Enqueue Tail block/index for Dequeue Reduces need to read/update global shared variables 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock scan * Thread B headBlock head tailBlock tail Enqueue Find the right block (first via TLS, then TLS->next or globalHeadBlock) Search the block for the first empty element 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock Add with CAS * Thread B headBlock head tailBlock tail Enqueue Find the right block (first via TLS, then TLS->next or globalHeadBlock) Search the block for the first empty element Update element with CAS (Also, the linearization point) 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock scan * Thread B headBlock head tailBlock tail Dequeue Find the right block (first via TLS, then TLS->next or globalTailBlock) Search the block for the first valid element 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock Remove with CAS * Thread B headBlock head tailBlock tail Dequeue Find the right block (first via TLS, then TLS->next or globalTailBlock) Search the block for the first valid element Remove with CAS, replace with NULL2 (linearization point) 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock * Maintaining the chain of blocks Helping scheme when moving between blocks Invariants to be maintained globalHeadBlock points to The newest block or the block before it globalTailBlock points to The oldest active block (not deleted) or the block before it 07/11/2018 Anders Gidenstam, University of Borås

Maintaining the chain of blocks globalTailBlock globalHeadBlock * Updating globalTailBlock Case 1 “Leader” Finds the block empty If needed help to ensure globalTailBlock points to tailBlock (or a newer block) Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås

Maintaining the chain of blocks globalTailBlock globalHeadBlock * * Updating globalTailBlock Case 1 “Leader” Finds the block empty …Helping done… Set delete mark Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås

Maintaining the chain of blocks globalTailBlock globalHeadBlock * * Updating globalTailBlock Case 1 “Leader” Finds the block empty …Helping done… Set delete mark Update globalTailBlock pointer Move own tailBlock pointer Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås

Maintaining the chain of blocks globalTailBlock globalHeadBlock * * Thread A headBlock head tailBlock tail Updating globalTailBlock Case 2: “Way out of date” tailBlock->next marked deleted Restart with globalTailBlock 07/11/2018 Anders Gidenstam, University of Borås

Maintaining the chain of blocks globalTailBlock globalHeadBlock * * Thread A headBlock head tailBlock tail Updating globalTailBlock Case 2: “Way out of date” tailBlock->next marked deleted Restart with globalTailBlock 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås Minding the Cache globalTailBlock globalHeadBlock Blocks occupy one cache-line Cache-lines for enqueue v.s. dequeue are disjoint (except when near empty) Enqueue/dequeue will cause coherence traffic for the affected block Scanning for the head/tail involves one cache-line Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås Minding the Cache globalTailBlock globalHeadBlock Blocks occupy one cache-line Cache-lines for enqueue v.s. dequeue are disjoint (except when near empty) Enqueue/dequeue will cause coherence traffic for the affected block Scanning for the head/tail involves one cache-line Cache-lines Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås

Scanning in a weak memory model? globalTailBlock globalHeadBlock Scanning for empty Scanning for first item to dequeue Key observations Life cycle for element values (NULL1 -> value -> NULL2) Elements are updated with CAS thus requiring the old value to be the expected one. Scanning only skips values later in the life cycle Reading an old value is safe (will try CAS and fail) 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås Outline Introduction Lock-free synchronization The Problem & Related work The lock-free queue algorithm Experiments Conclusions 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation Micro benchmark Threads execute enqueue and dequeue operations on a shared queue High contention. Test Configurations Random 50% / 50%, initial size 0 Random 50% / 50%, initial size 1000 1 Producer / N-1 Consumers N-1 Producers / 1 Consumer Measured throughput in items/sec #dequeues not returning EMPTY 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation Micro benchmark Algorithms [Michael & Scott, 1996] [Michael & Scott, 1996] + Elimination [Moir, Nussbaum, Shalev & Shavit, 2005] [Hoffman, Shalev & Shavit, 2007] [Tsigas & Zhang, 2001] The new Cache-Aware Queue [Gidenstam, Sundell & Tsigas, 2010] PC Platform CPU: Intel Core i7 920 @ 2.67 GHz 4 cores with 2 hardware threads each RAM: 6 GB DDR3 @ 1333 MHz Windows 7 64-bit 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation (i) 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation (ii) 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation (iii) 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation (iv) 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås Conclusions The Cache-Aware Lock-free Queue The first lock-free queue algorithm for multiple producers/consumers with all of the properties below Designed to be cache-friendly Designed for the weak memory consistency provided by contemporary hardware Is disjoint-access parallel (except when near empty) Use thread-local storage for reduced communication Use a linked-list of array blocks for efficient dynamic size support 07/11/2018 Anders Gidenstam, University of Borås

Thank you for listening! Questions? 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås 07/11/2018 Anders Gidenstam, University of Borås

Experimental evaluation Scalable? No. FIFO order limits the possibilities for concurrency 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock * Basic idea: “infinite” array of elements. Primary synchronization on the elements Compare-And-Swap (NULL1 -> Value -> NULL2 avoids ABA) Divided into blocks of elements New empty blocks added as needed Emptied blocks are removed and eventually reclaimed Block fields: Elements, next, (filled, emptied), deleted. Linked chain of dynamically allocated blocks Lock-free memory management needed for safe reclamation! Beware&Cleanup [Gidenstam, Papatriantafilou, Sundell & Tsigas, 2009] 07/11/2018 Anders Gidenstam, University of Borås

Anders Gidenstam, University of Borås The Algorithm globalTailBlock globalHeadBlock Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås

System Model: Memory Consistency CPU core cache Shared Memory (or shared cache) Store buffer A process’ Reads/writes may reach memory out of order Reads of own writes appear in program order Atomic synchronization primitive/instruction Single word Compare-and-Swap Acts as memory barrier for own reads and writes All own reads/writes before are done before All own reads/writes after are done after The affected cache block is held exclusively Cache-coherency 07/11/2018 Anders Gidenstam, University of Borås

Maintaining the chain of blocks globalTailBlock globalHeadBlock * Updating globalTailBlock Case 1 Leader Finds the block empty If no next => queue empty (done) Thread A headBlock head tailBlock tail 07/11/2018 Anders Gidenstam, University of Borås