Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer.

Slides:

Advertisements

Similar presentations

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Advertisements

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Chapter 6: Process Synchronization

Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Scalable and Lock-Free Concurrent Dictionaries

Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.

Multithreaded Programs in Java. Tasks and Threads A task is an abstraction of a series of steps – Might be done in a separate thread – Java libraries.

Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.

CS444/CS544 Operating Systems Introduction to Synchronization 2/07/2007 Prof. Searleman

Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.

A Mile-High View of Concurrent Algorithms Hagit Attiya Technion.

CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman

INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.

Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

Synchronization Solutions

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

Threads and Critical Sections Vivek Pai / Kai Li Princeton University.

SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.

Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.

CS 153 Design of Operating Systems Spring 2015 Lecture 11: Scheduling & Deadlock.

Concurrency, Mutual Exclusion and Synchronization.

Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.

Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.

Håkan Sundell, Chalmers University of Technology 1 Simple and Fast Wait-Free Snapshots for Real-Time Systems Håkan Sundell Philippas.

A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

6.852: Distributed Algorithms Spring, 2008 Class 13.

Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

CSE 425: Concurrency II Semaphores and Mutexes Can avoid bad inter-leavings by acquiring locks –Guard access to a shared resource to take turns using it.

Non-Blocking Concurrent Data Objects With Abstract Concurrency By Jack Pribble Based on, “A Methodology for Implementing Highly Concurrent Data Objects,”

DOUBLE INSTANCE LOCKING A concurrency pattern with Lock-Free read operations Pedro Ramalhete Andreia Correia November 2013.

1 Interprocess Communication (IPC) - Outline Problem: Race condition Solution: Mutual exclusion –Disabling interrupts; –Lock variables; –Strict alternation.

Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.

Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.

CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.

Agenda  Quick Review  Finish Introduction  Java Threads.

Big Picture Lab 4 Operating Systems C Andras Moritz

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

CS703 – Advanced Operating Systems

Background on the need for Synchronization

Håkan Sundell Philippas Tsigas

Atomic Operations in Hardware

Atomic Operations in Hardware

Threads and Memory Models Hal Perkins Autumn 2011

Thread Implementation Issues

Concurrency: Mutual Exclusion and Process Synchronization

CSCI1600: Embedded and Real Time Software

Memory Consistency Models

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Chapter 6: Synchronization Tools

CSCI1600: Embedded and Real Time Software

Process/Thread Synchronization (Part 2)

Presentation transcript:

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas

2 WHY PARALLEL PROGRAMMING IS ESSSENTIAL IN DISTRIBUTED SYSTEMS AND NETWORKING © Philippas Tsigas

3 Philippas Tsigas How did we reach there? Picture from Pat Gelsinger, Intel Developer Forum, Spring 2004 (Pentium at 90W)

4 3GHz Concurrent Software Becomes Essential 3GHz 6GHz 3GHz 12GHz 24GHz 1 Core 2 Cores 4 Cores 8 Cores Our work is to help the programmer to develop efficient parallel programs but also survive the multicore transition. © Philippas Tsigas 1) Scalability becomes an issue for all software. 2) Modern software development relies on the ability to compose libraries into larger programs.

5 DISTRIBUTED APPLICATIONS © Philippas Tsigas

6 Distributed Applications Demand Quite High Level Data Sharing:  Commercial computing (media and information processing)  Control Computing (on board flight-control system) © Philippas Tsigas

7 Data Sharing: Gameplay Simulation as an example This is the hardest problem…  10,000’s of objects  Each one contains mutable state  Each one updated 30 times per second  Each update touches 5-10 other objects Manual synchronization (shared state concurrency) is hopelessly intractable here. Solutions? Slide: Tim Sweeney CEO Epic Games POPL 2006 © Philippas Tsigas

8 NETWORKING © Philippas Tsigas

9 40 multithreaded packet-processing engines embedded-video/routers/popup.html  On chip, there are bit, 1.2-GHz packet-processing engines. Each engine works on a packet from birth to death within the Aggregation Services Router.  each multithreaded engine handles four threads (each thread handles one packet at a time) so each QuantumFlow Processor chip has the ability to work on 160 packets concurrently © Philippas Tsigas

10 DATA SHARING © Philippas Tsigas

11 Data Sharing: Gameplay Simulation as an example This is the hardest problem…  10,000’s of objects  Each one contains mutable state  Each one updated 30 times per second  Each update touches 5-10 other objects Manual synchronization (shared state concurrency) is hopelessly intractable here. Solutions? Slide: Tim Sweeney CEO Epic Games POPL 2006 © Philippas Tsigas

12 Philippas Tsigas Blocking Data Sharing class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } next = 0 A typical Counter Impl: Thread1: getNumber() t = 0 Thread2: getNumber() result=0 Lock released Lock acquired result=1 next = 1next = 2

13 Philippas Tsigas Do we need Synchronization? class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; } What can go wrong here? next = 0 Thread1: getNumber() t = 0 Thread2: getNumber() t = 0 result=0 next = 1 result=0 

14 Blocking Synchronization = Sequential Behavior © Philippas Tsigas

15 BS ->Priority Inversion  A high priority task is delayed due to a low priority task holding a shared resource. The low priority task is delayed due to a medium priority task executing.  Solutions: Priority inheritance protocols  Works ok for single processors, but for multiple processors … Task H: Task M: Task L: © Philippas Tsigas

16 Critical Sections + Multiprocessors  Reduced Parallelism. Several tasks with overlapping critical sections will cause waiting processors to go idle. Task 1: Task 2: Task 3: Task 4: © Philippas Tsigas

17  The BIGEST Problem with Locks? Blocking Locks are not composable All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides. © Philippas Tsigas

18 Interprocess Synchronization = Data Sharing  Synchronization is required for concurrency  Mutual exclusion (Semaphores, mutexes, spin-locks, disabling interrupts: Protects critical sections) - Locks limits concurrency - Busy waiting – repeated checks to see if lock has been released or not - Convoying – processes stack up before locks - Blocking Locks are not composable - All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides.  A better approach is … not to lock

19 Philippas Tsigas A Lock-free Implementation

How did it start?  ”Synchronization is an enforcing mechanism used to impose constraints on the order of execution of threads.... Synchronization is used to coordinate threads execution and manage shared data.” Does it have to be like that? When we share data do we have to impose constraints on the execution of threads?

21 HOW SAFE IS IT: LET US START FROM THE BEGINING © Philippas Tsigas

22  Object in memory - Supports some set of operations (ADT) - Concurrent access by many processes/threads - Useful to e.g.  Exchange data between threads  Coordinate thread activities Shared Abstract Data Types P1 P2 P3 Op A Op B P4 Op B Op A

23 Borrowed from H. Attiya Executing Operations P1P1 invocationresponse P2P2 P3P3

24 Interleaving Operations Concurrent execution

25 Interleaving Operations (External) behavior

26 Interleaving Operations, or Not Sequential execution

27 Interleaving Operations, or Not Sequential behavior: invocations & response alternate and match (on process & object) Sequential specification: All the legal sequential behaviors, satisfying the semantics of the ADT - E.g., for a (LIFO) stack: pop returns the last item pushed

28 Correctness: Sequential consistency [Lamport, 1979]  For every concurrent execution there is a sequential execution that - Contains the same operations - Is legal (obeys the sequential specification) - Preserves the order of operations by the same process

29 Sequential Consistency: Examples push(4) pop():4push(7) Concurrent (LIFO) stack push(4) pop():4push(7) Last In First Out

30 Sequential Consistency: Examples push(4) pop():7push(7) Concurrent (LIFO) stack Last In First Out

31 Safety: Linearizability  Linearizable ADTs - Sequential specification defines legal sequential executions - Concurrent operations allowed to be interleaved - Operations appear to execute atomically  External observer gets the illusion that each operation takes effect instantaneously at some point between its invocation and its response(preserves order of all operation) time push(4) pop():4push(7) push(4) pop():4 push(7) Last In First Out concurrent LIFO stack T1T1 T2T2

32 Safety II An accessible node is never freed.

33 Liveness Non-blocking implementations - Wait-free implementation of an ADT [Lamport, 1977]  Every operation finishes in a finite number of its own steps. - Lock-free (≠ FREE of LOCKS) implementation [Lamport, 1977]  At least one operation (from a set of concurrent operation) finishes in a finite number of steps (the data structure as a system always make progress)

34 Liveness II  every garbage node is eventually collected

35 Abstract Data Types (ADT)  Cover most concurrent applications  At least encapsulate their data needs  An object-oriented programming point of view  Abstract representation of data & set of methods (operations) for accessing it  Signature  Specification data

36 Implementing High-Level ADT data Using lower-level ADTs & procedures

37 Lower-Level Operations  High-level operations translate into primitives on base objects that are available on H/W  Obvious: read, write  Common: compare&swap (CAS), LL/SC, FAA

38 CAN I FIND A JOB IF I STUDY THIS? © Philippas Tsigas

 8 Feb 2002 Release of NOBLE version 1.0  23 Jan 2002 Expert Group Formation (JSR: Java Concurrency Utilities)  8 Jan 2004 JSR first Release  29 Aug 2006 INTEL’s TBB release 1.0

40 ERLANG  OTP_R15A: R15 pre-release OTP_R15A: R15 pre-release  Written by Kenneth, 23 Nov 2011Kenneth  We have recently pushed a new master to GitHub tagged OTP_R15A.GitHub  This is a stabilized snapshot of the current R15 development (to be released as R15B on December 14:th) which, among other things, includes:  OTP-9468 'Line numbers in exceptions'  OTP-9451 'Parallel make'  OTP-4779 A new GUI for Observer. Integrating pman, etop and tv into observer with tracing facilities.  OTP-7775 A number of memory allocation optimizations have been implemented. Most optimizations reduce contention caused by synchronization between threads during allocation and deallocation of memory. Most notably: Synchronization of memory management in scheduler specific allocator instances has been rewritten to use lock-free synchronization.  Synchronization of memory management in scheduler specific pre-allocators has been rewritten to use lock-free synchronization.  The 'mseg_alloc' memory segment allocator now use scheduler specific instances instead of one instance. Apart from reducing contention this also ensures that memory allocators always create memory segments on the local NUMA node on a NUMA system.  OTP-9632 An ERTS internal, generic, many to one, lock-free queue for communication between threads has been introduced. The many to one scenario is very common in ERTS, so it can be used in a lot of places in the future. Currently it is used by scheduling of certain jobs, and the async thread pool, but more uses are planned for the future.  Drivers using the driver_async functionality are not automatically locked to the system anymore, and can be unloaded as any dynamically linked in driver.  Scheduling of ready async jobs is now also interleaved in between other jobs. Previously all ready async jobs were performed at once.  OTP-9631 The ERTS internal system block functionality has been replaced by new functionality for blocking the system. The old system block functionality had contention issues and complexity issues. The new functionality piggy-backs on thread progress tracking functionality needed by newly introduced lock-free synchronization in the runtime system. When the functionality for blocking the system isn't used, there is more or less no overhead at all. This since the functionality for tracking thread progress is there and needed anyway. ... and much much more.  This is not a full release of R15 but rather a pre-release. Feel free to try our R15A release and get back to us with your findings.  Your feedback is important to us and highly welcomed.  Regards,  The OTP Team © Philippas Tsigas

41 © Philippas Tsigas

42 © Philippas Tsigas

Locks are not supported  Not in CUDA, not in OpenCL  Fairness of hardware scheduler unknown  Thread block holding a lock might be swapped out indefinitely, for example

No Fairness Guarantees … while(atomicCAS(&lock,0,1)); ctr++; lock = 0; … Thread holding lock is never scheduled!

Where do we stand at?