(C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas.

Slides:



Advertisements
Similar presentations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Advertisements

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Chapter 6: Process Synchronization
Scalable and Lock-Free Concurrent Dictionaries
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Håkan Sundell, Chalmers University of Technology 1 Space Efficient Wait-free Buffer Sharing in Multiprocessor Real-time Systems Based.
A Lock-Free Multiprocessor OS Kernel1 Henry Massalin and Calton Pu Columbia University June 1991 Presented by: Kenny Graunke.
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
1 Concurrency: Deadlock and Starvation Chapter 6.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
Algorithms for Synchronization and Consistency in Concurrent System Services Anders Gidenstam Distributed Computing and Systems group, Department of Computer.
1 Lock-Free Linked Lists Using Compare-and-Swap by John Valois Speaker’s Name: Talk Title: Larry Bush.
Practical and Lock-Free Doubly Linked Lists Håkan Sundell Philippas Tsigas.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Håkan Sundell, Chalmers University of Technology 1 Using Timing Information on Wait-Free Algorithms in Real-Time Systems (2 papers)
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
November 15, 2007 A Java Implementation of a Lock- Free Concurrent Priority Queue Bart Verzijlenberg.
Håkan Sundell, Chalmers University of Technology 1 Applications of Non-Blocking Data Structures to Real-Time Systems Seminar for the.
Håkan Sundell, Chalmers University of Technology 1 Simple and Fast Wait-Free Snapshots for Real-Time Systems Håkan Sundell Philippas.
Games Development 2 Concurrent Programming CO3301 Week 9.
A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.
Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.
Non-blocking Data Structures for High- Performance Computing Håkan Sundell, PhD.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
Skip Lists 二○一七年四月二十五日
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
DISTRIBUTED COMPUTING
Skiplist-based Concurrent Priority Queues Itay Lotan Stanford University Nir Shavit Sun Microsystems Laboratories.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
Scalable lock-free Stack Algorithm Wael Yehia York University February 8, 2010.
Processes and Threads Processes and their scheduling
Håkan Sundell Philippas Tsigas
A Lock-Free Algorithm for Concurrent Bags
Anders Gidenstam Håkan Sundell Philippas Tsigas
Arrays and Linked Lists
Yiannis Nikolakopoulos
NOBLE: A Non-Blocking Inter-Process Communication Library
A Concurrent Lock-Free Priority Queue for Multi-Thread Systems
CSE 153 Design of Operating Systems Winter 19
Presentation transcript:

(C) Ph. Tsigas © Ph. Tsigas Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas © Ph. Tsigas NOBLE A Library of Non-Blocking Concurrent Data Structures Philippas Tsigas Results jointly with: Håkan Sundell and Yi Zhang

(C) Ph. Tsigas Overview Introduction Synchronization Non-blocking Synchronization Is Non-blocking Synchronization performance- beneficial for Parallel Applications? NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer? Lock-free Skip lists Conclusions, Future Work

(C) Ph. Tsigas Systems: SMP Cache-coherent distributed shared memory multiprocessor systems: UMA NUMA

(C) Ph. Tsigas Synchronization Barriers Locks, semaphores,… (mutual exclusion) “A significant part of the work performed by today’s parallel applications is spent on synchronization.”...

(C) Ph. Tsigas Lock-Based Synchronization: Sequential

(C) Ph. Tsigas Non-blocking Synchronization Lock-Free Synchronization Optimistic approach Assumes it’s alone and prepares operation which later takes place (unless interfered) in one atomic step, using hardware atomic primitives Interference is detected via shared memory Retries until not interfered by other operations Can cause starvation

(C) Ph. Tsigas type Qtype = record v: valtype; next: pointer to Qtype end shared var Tail: pointer to Qtype; local var old, new: pointer to Qtype procedure Enqueue (input: valtype) new := (input, NIL); repeat old := Tail until CAS2(&Tail, &(old->next), old, NIL, new, new) type Qtype = record v: valtype; next: pointer to Qtype end shared var Tail: pointer to Qtype; local var old, new: pointer to Qtype procedure Enqueue (input: valtype) new := (input, NIL); repeat old := Tail until CAS2(&Tail, &(old->next), old, NIL, new, new) Example: Shared Queue Tail old Tail old new The usual approach is to implement operations using retry loops. Here’s an example: Slide provided by Jim Anderson

(C) Ph. Tsigas Non-blocking Synchronization Lock-Free Synchronization Avoids problems that locks have Fast Starvation? (not in the Context of HPC) Wait-Free Synchronization Always finishes in a finite number of its own steps. Complex algorithms Memory consuming Less efficient on average than lock-free

(C) Ph. Tsigas Overview Introduction Synchronization Non-blocking Synchronization Is Non-blocking Synchronization performance- beneficial for Parallel Scientific Applications? NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer? Conclusions, Future Work

(C) Ph. Tsigas Non-blocking Synchronisation Synchronisation: An alternative approach for synchronisation introduced 25 years ago Many theoretical results Evaluation: Micro-benchmarks shows better performance than mutual exclusion in real or simulated multiprocessor systems.

(C) Ph. Tsigas Practice Non-blocking synchronization is still not used in practical applications Non-blocking solutions are often complex having non-standard or un-clear interfaces non-practical ? ?

(C) Ph. Tsigas Practice Question? ”How the performance of parallel scientific applications is affected by the use of non-blocking synchronisation rather than lock-based one?” ? ? ?

(C) Ph. Tsigas Answers The identification of the basic locking operations that parallel programmers use in their applications. The efficient non-blocking implementation of these synchronisation operations. The architectural implications on the design of non-blocking synchronisation. Comparison of the lock-based and lock-free versions of the respective applications How the performance of parallel scientific applications is affected by the use of non- blocking synchronisation rather than lock- based one?

(C) Ph. Tsigas Applications Oceansimulates eddy currents in an ocean basin. Radiositycomputes the equilibrium distribution of light in a scene using the radiosity method. Volrendrenders 3D volume data into an image using a ray- casting method. WaterEvaluates forces and potentials that occur over time between water molecules. Spark98a collection of sparse matrix kernels. Each kernel performs a sequence of sparse matrix vector product operations using matrices that are derived from a family of three-dimensional finite element earthquake applications.

(C) Ph. Tsigas Removing Locks in Applications Many locks are “Simple Locks”. Many critical sections contain shared floating- point variables. Large critical sections. CAS, FAA and LL/SC can be used to implement non-blocking version. Floating-point synchronization primitives are needed. A Double- Fetch-and-Add primitive was designed. Efficient Non-blocking implementations of big ADT are used.

(C) Ph. Tsigas Experimental Results: Speedup 58P 32P 24P

(C) Ph. Tsigas SPARK98 Before: spark_setlock(lockid); w[col][0] += A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]; w[col][1] += A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]; w[col][2] += A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]; spark_unsetlock(lockid); After: dfad(&w[col][0], A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]); dfad(&w[col][1], A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]); dfad(&w[col][2], A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]);

(C) Ph. Tsigas Overview Introduction Synchronization Non-blocking Synchronization Is Non-blocking Synchronization beneficial for Parallel Scientific Applications? NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer? Conclusions, Future Work

(C) Ph. Tsigas Practice Non-blocking synchronization is still not used in practical applications Non-blocking solutions are often complex having non-standard or un-clear interfaces non-practical ? ?

(C) Ph. Tsigas Create a non-blocking inter-process communication interface with the properties: Attractive functionality Programmer friendly Easy to adapt existing solutions Efficient Portable Adaptable for different programming languages NOBLE: Brings Non-blocking closer to Practice

(C) Ph. Tsigas NOBLE Design: Portable #define NBL... Noble.h #include “Platform/Primitives.h” … QueueLF.c #include “Platform/Primitives.h” … StackLF.c CAS, TAS, Spin-Locks … SunHardware.asm CAS, TAS, Spin-Locks... IntelHardware.asm... Platform dependent Platform in-dependent Exported definitions Identical for all platforms

(C) Ph. Tsigas Using NOBLE stack=NBLStackCreateLF(10000);... Main NBLStackPush(stack, item); or item=NBLStackPop(stack); Threads #include... NBLStack* stack; Globals First create a global variable handling the shared data object, for example a stack: Create the stack with the appropriate implementation: When some thread wants to do some operation:

(C) Ph. Tsigas Using NOBLE When the data structure is not in use anymore: stack=NBLStackCreateLF(10000);... NBLStackFree(stack); Main #include... NBLStack* stack; Globals

(C) Ph. Tsigas Using NOBLE stack=NBLStackCreateLB();... NBLStackFree(stack); Main NBLStackPush(stack, item); or item=NBLStackPop(stack); Threads #include... NBLStack* stack; Globals To change the synchronization mechanism, only one line of code has to be changed!

(C) Ph. Tsigas Design: Attractive functionality Data structures for multi-threaded usage FIFO Queues Priority Queues Dictionaries Stacks Singly linked lists Snapshots MWCAS... Clear specifications

(C) Ph. Tsigas Status Multiprocessor support Sun Solaris (Sparc) Win32 (Intel x86) SGI (Mips) Linux (Intel x86) Availiable for academic use:

(C) Ph. Tsigas Did our Work have any Impact? 1) Industry has initialized contacts and uses a test version of NOBLE. 2) Free-ware developers has showed interest. 3) Interest from research organisations. NOBLE is freely availiable for research and educational purposes.

(C) Ph. Tsigas A Lock-Free Skip list Best Paper Award H. Sundell, Ph. Tsigas Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´03), May 2003 (TR 2002). Best Paper Award A very similar skip list algorithm will be presented this August at the ACM Symposium on Principles of Distributed Computing (PODC 2004): ”Lock-Free Linked Lists and Skip Lists” Mikhail Fomitchev, Eric Ruppert

(C) Ph. Tsigas Randomized Algorithm: Skip Lists William Pugh: ”Skip Lists: A Probabilistic Alternative to Balanced Trees”, 1990 Layers of ordered lists with different densities, achieves a tree-like behavior Time complexity: O(log 2 N) – probabilistic! HeadTail 50% 25% …

(C) Ph. Tsigas Our Lock-Free Concurrent Skip List Define node state to depend on the insertion status at lowest level as well as a deletion flag Insert from lowest level going upwards Set deletion flag. Delete from highest level going downwards DDDDDDD p p D

(C) Ph. Tsigas Concurrent Insert vs. Delete operations Problem: - both nodes are deleted! Solution (Harris et al): Use bit 0 of pointer to mark deletion status Delete Insert a) b) * a) b) c)

(C) Ph. Tsigas Dynamic Memory Management Problem: System memory allocation functionality is blocking! Solution (lock-free), IBM freelists: Pre-allocate a number of nodes, link them into a dynamic stack structure, and allocate/reclaim using CAS HeadMem 1Mem 2Mem n … Used 1 Reclaim Allocate

(C) Ph. Tsigas The ABA problem Problem: Because of concurrency (pre-emption in particular), same pointer value does not always mean same node (i.e. CAS succeeds)!!! Step 1: Step 2:

(C) Ph. Tsigas The ABA problem Solution: (Valois et al) Add reference counting to each node, in order to prevent nodes that are of interest to some thread to be reclaimed until all threads have left the node 1*6* ??? 1 CAS Failes! New Step 2:

(C) Ph. Tsigas Helping Scheme Threads need to traverse safely Need to remove marked-to-be-deleted nodes while traversing – Help! Finds previous node, finish deletion and continues traversing from previous node 1 42* 1 42* or ? ? 1 42*

(C) Ph. Tsigas Overlapping operations on shared data Example: Insert operation - which of 2 or 3 gets inserted? Solution: Compare-And-Swap atomic primitive: CAS(p:pointer to word, old:word, new:word):boolean atomic do if *p = old then *p := new; return true; else return false; Insert 3 Insert 2

(C) Ph. Tsigas Experiments 1-30 threads on platforms with different levels of real concurrency Insert vs. DeleteMin operations by each thread. 100 vs initial inserts Compare with other implementations: Lotan and Shavit, 2000 Hunt et al “An Efficient Algorithm for Concurrent Priority Queue Heaps”, 1996

(C) Ph. Tsigas Full Concurrency

(C) Ph. Tsigas Medium Pre-emption

(C) Ph. Tsigas High Pre-emption

(C) Ph. Tsigas Lessons Learned The Non-Blocking Synchronization Paradigm can be suitable and beneficial to large scale parallel applications. Experimental Reproducable Work. Many results claimed by simulation are not consistent with what we observed. Applications gave us nice problems to look at and do theoretical work on. (IPDPS 2003 Algorithmic Best Paper Award) NOBLE helped programmers to trust our implementations.

(C) Ph. Tsigas Future Work Extend NOBLE for loosely coupled systems. Extend the set of data structures supported by NOBLE based on the needs of the applications. Reactive-Synchronisation

Questions? Contact Information: Address: Philippas Tsigas Computing Science Chalmers University of Technology cs.chalmers.se Web:

(C) Ph. Tsigas Pointers: NOBLENOBLE: A Non-Blocking Inter-Process Communication Library. ACM Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers (LCR ´02). Evaluating The Performance of Non-Blocking Synchronization on Shared Memory Multiprocessors. ACM SIGMETRICS 2001/Performance2001 Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2001). Integrating Non-blocking Synchronization in Parallel Applications: Performance Advantages and Methodologies. ACM Workshop on Software and Performance (WOSP ´01). A Simple, Fast and Scalable Non-Blocking Concurrent FIFO queue for Shared Memory Multiprocessor Systems, ACM Symposium on Parallel Algorithms and Architectures (SPAA ´01). Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´03). Fast, Reactive and Lock-free Multi-word Compare-and-swap Algorithms. 12th EEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ´03) Scalable and Lock-free Cuncurrent Dictionaries. Proceedings of the 19th ACM Symposium on Applied Computing (SAC ’04).