Contention in shared memory multiprocessors Multiprocessor synchronization algorithms (20225241) Lecturer: Danny Hendler Definitions Lower bound for consensus.

Slides:

Advertisements

Similar presentations

Scalable Flat-Combining Based Synchronous Queues Danny Hendler, Itai Incze, Nir Shavit and Moran Tzafrir Presentation by Uri Golani.

Advertisements

Virtual Memory (Chapter 4.3)

CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.

Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.

MATH 224 – Discrete Mathematics

© 2005 P. Kouznetsov Computing with Reads and Writes in the Absence of Step Contention Hagit Attiya Rachid Guerraoui Petr Kouznetsov School of Computer.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.

A Completeness theorem for a class of synchronization objects Afek.Y, Weisberger.E, Weisman.H Presented by: Reut Schwartz.

Analysis of Algorithms CS Data Structures Section 2.6.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.

Håkan Sundell, Chalmers University of Technology 1 Space Efficient Wait-free Buffer Sharing in Multiprocessor Real-time Systems Based.

Safety Definitions and Inherent Bounds of Transactional Memory Eshcar Hillel.

Inherent limitations on DAP TMs 1 Inherent Limitations on Disjoint-Access Parallel Transactional Memory Hagit Attiya, Eshcar Hillel, Alessia Milani Technion.

Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.

Two Techniques for Proving Lower Bounds Hagit Attiya Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.

1 Friday, June 16, 2006 "In order to maintain secrecy, this posting will self-destruct in five seconds. Memorize it, then eat your computer." - Anonymous.

Progress Guarantee for Parallel Programs via Bounded Lock-Freedom Erez Petrank – Technion Madanlal Musuvathi- Microsoft Bjarne Steensgaard - Microsoft.

Max Registers, Counters, and Monotone Circuits James Aspnes, Yale University Hagit Attiya, Technion Keren Censor, Technion PODC 2009.

It Ain’t the Meat, it’s the Notion Why Theory is Essential to Teaching Concurrent Programming Maurice Herlihy Brown University.

Ordering and Consistent Cuts Presented By Biswanath Panda.

CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.

Distributed Algorithms (22903) Lecturer: Danny Hendler Global Computation in the presence of Link Failures.

What Can Be Implemented Anonymously ? Paper by Rachid Guerraui and Eric Ruppert Presentation by Amir Anter 1.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

Contention in shared memory multiprocessors Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler Definitions Lower bound for consensus.

Chapter Resynchsonous Stabilizer Chapter 5.1 Resynchsonous Stabilizer Self-Stabilization Shlomi Dolev MIT Press, 2000 Draft of Jan 2004, Shlomi.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.

1 Lock-Free Linked Lists Using Compare-and-Swap by John Valois Speaker’s Name: Talk Title: Larry Bush.

Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.

CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.

Unit III : Introduction To Data Structures and Analysis Of Algorithm 10/8/ Objective : 1.To understand primitive storage structures and types 2.To.

Håkan Sundell, Chalmers University of Technology 1 Using Timing Information on Wait-Free Algorithms in Real-Time Systems (2 papers)

1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given.

CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 3 (26/01/2006) Instructor: Haifeng YU.

Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.

1 Chapter 9 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.

CSC 211 Data Structures Lecture 13

6.852: Distributed Algorithms Spring, 2008 Class 13.

DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch Set 11: Asynchronous Consensus 1.

1 Chapter 10 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Fence Complexity in Concurrent Algorithms Petr Kuznetsov TU Berlin/DT-Labs.

Operating Systems CMPSC 473 Mutual Exclusion Lecture 11: October 5, 2010 Instructor: Bhuvan Urgaonkar.

CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.

Max Registers, Counters, and Monotone Circuits Keren Censor, Technion Joint work with: James Aspnes, Yale University Hagit Attiya, Technion (Slides

November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 16: Distributed Shared Memory 1.

Distributed Algorithms (22903) Lecturer: Danny Hendler The wait-free hierarchy and the universality of consensus This presentation is based on the book.

DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.

“Towards Self Stabilizing Wait Free Shared Memory Objects” By:  Hopeman  Tsigas  Paptriantafilou Presented By: Sumit Sukhramani Kent State University.

Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.

Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.

1 Chapter 8-1: Lower Bound of Comparison Sorts. 2 About this lecture Lower bound of any comparison sorting algorithm – applies to insertion sort, selection.

Distributed Algorithms (22903)

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

New Characterizations in Turnstile Streams with Applications

Algebraic Topology and Distributed Computing part two

Distributed Algorithms (22903)

Distributed Shared Memory

Sitting on a Fence: Complexity Implications of Memory Reordering

Lecture 1: Introduction

Time Complexity and the divide and conquer strategy

Presentation transcript:

Contention in shared memory multiprocessors Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler Definitions Lower bound for consensus Lower bounds for counters, stacks and queues

Contention in shared-memory systems Contention: the extent to which processes access the same memory locations simultaneously When multiple processes simultaneously write to the same memory location, they are being stalled High contention hurts performance!

Memory Stalls & Write-Contention (Model: Dwork, Herlihy, Waarts, 93) variable p0p0 p1p1 p2p2 pjpj Stalls# j Write-contention is the maximum number of processes that can be enabled to perform a write or read-modify-write operation to the same memory location simultaneously. What is the write-contention of the combining-tree counter?

Recall the consensus implementation we saw… Decide(v) ; code for p i, i=0,1 1.CAS(C, null, v) 2.return C Initially C=null We use a single object, C, that supports the compare&swap and read operations. What is the write-contention of this algorithm? n It can be shown that this is the write-contention of any wait-free consensus algorithm

What can we say about the worst-case contention-aware time complexity of objects such as counters, stacks and queues?

Naïve Counter Implementation FAI Last processes to succeed incur θ(n) time complexity! FAI Can we do much better? FAI object

We will see a time lower bound of √n on lock-free implementations of: counters, stacks, queues… (Hendler, Shavit, 2003) Any algorithm either (a) suffers high contention or (b) suffers high latency (step complexity) Var

The Memory-Steps Metric #read-variables - the number of distinct base objects read by an operation Memory stalls – The total number of memory stalls incurred by an operation memory-steps = #read-objects + memory-stalls We investigate the worst-case number of memory-steps incurred by a single high-level operation.

Capture Influence between processes Time complexity is determined by the extent by which operations by different processes influence each other.

Influence-level Shared Counter 17 Each of us may precede you and modify the value you will get! Influence level (w.r.t. p) FAI Hmmm… I will soon request a value p

Modifying Steps Shared Counter 17 FAI Hmmm… I will soon request a value Each of us may precede you! p q

Modifying Steps Shared Counter 17 Hmmm… I will soon request a value Each of us may precede you! p q FAI

Modifying Steps Shared Counter 17 FAI Hmmm… I will soon request a value Each of us may precede you! p q

Modifying Steps Shared Counter 18 Hmmm… I will soon request a value Each of us may precede you! p q 17 There’s an atomic step in which q modifies p’s solo- execution response We bring all the ‘Influencers’ to be on the verge of performing a modifying step FAI

Space/Write-contention tradeoff We bring all Influencers to be on the verge of a modifying step Each modifying step is necessarily a write/RMW operation S ≥S ≥ I C Space complexity Influence-level Write-contention

Latency/Contention tradeoff Base-objects on which there are outstanding modifying steps Shared Counter 17 FAI Hmmm… I will soon request a value p Process p can be made to read all these variables in the course of its operation. LR ≥LR ≥ I C # of read base objects Influence-level Write-contention

Time lower bound LRC ≥LRC ≥ I Time complexity is at least I Var

Influence(n) Objects Class The above lower bound holds for Influence(n) - a large class of object that includes: stacks, queues, hash-tables, pools, linearizable counters, consensus, approximate- agreement… It holds also for one-time implementations of these objects. Finding the tight bound is a challenging open question

A linear lower bound on the number of Stalls for long-lived objects (Fich, Hendler, Shavit, 2005) Metric is slightly different – we count the total number of stalls incurs while accessing multiple objects

Theorem: Consider any n-process implementation of an obstruction-free counter, then the worst-case number of stalls incurred by a process as it performs a fetch&increment operation is at least n-1. An implementation is obstruction-free if every process is guaranteed to terminate its operation if it runs solo long enough.

Worst-case stalls number ≥ n-1 for any OF counter implementation Start from an initial state. Fix a process p about to perform a fetch&increment operation. Consider the path it takes if it runs uninterrupted when only first-accesses to shared words are considered. p

Worst-case stalls number ≥ n-1 Start from an initial state. Fix a process p about to perform a fetch&increment operation. Consider the path it takes if it runs uninterrupted when only first-accesses to shared words are considered. p 2

Worst-case stalls number ≥ n-1 Start from an initial state. Fix a process p about to perform a fetch&increment operation. Consider the path it takes if it runs uninterrupted when only first-accesses to shared words are considered. p 2 3

Worst-case stalls number ≥ n-1 Start from an initial state. Fix a process p about to perform a fetch&increment operation. Consider the path it takes if it runs uninterrupted when only first-accesses to shared words are considered. p 2 34

Worst-case stalls number ≥ n-1 p 2 34 Let O1 be the first word along p's path that is written by some other process in any p-free execution There must be such a word. O1O1

Worst-case stalls number ≥ n-1 p 2 34 O1O1 Let E1 be an execution that maximizes the number of processes that are about to write to O1 over all p-free executions.

Worst-case stalls number ≥ n-1 p 2 34 O1O1 If (k 1 =n-1) then we are done. Otherwise, we show that p must access yet another word that may be written by other processes.

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1?

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1?

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1?

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1?

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1?

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1?

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1? But now the rest of the path may change....

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1? But now the rest of the path may change.... 3

Worst-case stalls number ≥ n-1 p 2 34 O1O1 What happens if p incurs the stalls on O1? But now the rest of the path may change.... 3

Worst-case stalls number ≥ n-1 p 24 O1O1 What happens if p incurs the stalls on O1? But now the rest of the path may change Assume p gets value v

Worst-case stalls number ≥ n-1 24 O1O1 3 v: the value returned by p if we let it run and incur the stalls c: the number of fetch&increment operations completed before p starts its operation We have: v  {c,…,c+K1} p

Worst-case stalls number ≥ n-1 v: the value returned by p if we let it run and incur the stalls c: the number of fetch&increment operations completed before p starts its operation We have: v  {c,…,c+K1}

Worst-case stalls number ≥ n-1 24 O1O1 3 v: the value returned by p if we let it run and incur the stalls c: the number of fetch&increment operations completed before p starts its operation p We select some process q  G 1  {p} We let q perform K 1 +1 fetch&increment operations q must write to a word read by p after O1

Worst-case stalls number ≥ n-1 24 O1O1 3 v: the value returned by p if we let it run and incur the stalls c: the number of fetch&increment operations completed before p starts its operation p We select some process q  G 1  {p} We let q perform K 1 +1 fetch&increment operations q must write to a word read by p after O1 q

Worst-case stalls number ≥ n-1 v: the value returned by p if we let it run and incur the stalls c: the number of fetch&increment operations completed before p starts its operation We let q perform K 1 +1 fetch&increment operations q must write to a word read by p after O1

Worst-case stalls number ≥ n-1 24 O1O1 3 p Let O 2 be first word that will be accessed by p after it incurs the K 1 stalls that is written by some process  G 1  {p} Let E 2 be an execution that maximizes the number of processes that are about to write to O 2 over all (G1  {p})-free executions.

Worst-case stalls number ≥ n-1 O1O1 p Continuing with this construction we get: O2O2 |G 2 | = K 2 |G m | = K m OmOm

Conclusion: “Naïve ” implementation is best possible! (In terms of worst-case execution.) FAI FAI object