Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for.

Slides:



Advertisements
Similar presentations
Operating Systems Semaphores II
Advertisements

Scalable Flat-Combining Based Synchronous Queues Danny Hendler, Itai Incze, Nir Shavit and Moran Tzafrir Presentation by Uri Golani.
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Operating Systems Part III: Process Management (Process Synchronization)
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Monitors & Blocking Synchronization 1. Producers & Consumers Problem Two threads that communicate through a shared FIFO queue. These two threads can’t.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
04 Trees Part I CS 310 photo ©Oregon Scenics used with permissionOregon Scenics All figures labeled with “Figure X.Y” Copyright © 2006 Pearson Addison-Wesley.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
1 Testing Concurrent Programs Why Test?  Eliminate bugs?  Software Engineering vs Computer Science perspectives What properties are we testing for? 
Synchronization (Barriers) Parallel Processing (CS453)
Art of Multiprocessor Programming 1 Programming Paradigms for Concurrency Pavol Černý, Vasu Singh, Thomas Wies.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CSC 211 Data Structures Lecture 13
Data Structures Trees Phil Tayco Slide version 1.0 Apr. 23, 2015.
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Lecture 3: Uninformed Search
ICS 313: Programming Language Theory Chapter 13: Concurrency.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Week 8 - Wednesday.  What did we talk about last time?  Level order traversal  BST delete  2-3 trees.
Self Stabilizing Smoothing and Counting Maurice Herlihy, Brown University Srikanta Tirthapura, Iowa State University.
CIS 842: Specification and Verification of Reactive Systems Lecture INTRO-Examples: Simple BIR-Lite Examples Copyright 2004, Matt Dwyer, John Hatcliff,
Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
Priority Queues Dan Dvorin Based on ‘The Art of Multiprocessor Programming’, by Herlihy & Shavit, chapter 15.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
CS 367 Introduction to Data Structures Lecture 8.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR.
15.082J & 6.855J & ESD.78J September 30, 2010 The Label Correcting Algorithm.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CS4315A. Berrached:CMS:UHD1 Process Synchronization Chapter 8.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Concurrency and Performance Based on slides by Henri Casanova.
Agenda  Quick Review  Finish Introduction  Java Threads.
(c) University of Washington20c-1 CSC 143 Binary Search Trees.
Concurrent Programming in Java Based on Notes by J. Johns (based on Java in a Nutshell, Learning Java) Also Java Tutorial, Concurrent Programming in Java.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Background on the need for Synchronization
Shared Counters and Parallelism
Shared Counters and Parallelism
Ivy Eva Wu.
Algorithm Analysis CSE 2011 Winter September 2018.
Counting, Sorting, and Distributed Coordination
Algorithm design and Analysis
Multiprocessor Synchronization Algorithms ( )
Barrier Synchronization
Barrier Synchronization
Shared Counters and Parallelism
Pools, Counters, and Parallelism
CSC 143 Binary Search Trees.
CSE 153 Design of Operating Systems Winter 19
Shared Counters and Parallelism
Shared Counters and Parallelism
Presentation transcript:

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS640 University of Pennsylvania

Art of Multiprocessor Programming 2 A Shared Pool Put –Inserts object –blocks if full Remove –Removes & returns an object –blocks if empty public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects

Art of Multiprocessor Programming 3 put Simple Locking Implementation put

Art of Multiprocessor Programming 4 put Simple Locking Implementation put Problem: hot- spot contention

Art of Multiprocessor Programming 5 put Simple Locking Implementation Problem: hot- spot contention Problem: sequential bottleneck put

Art of Multiprocessor Programming 6 put Simple Locking Implementation Problem: hot- spot contention Problem: sequential bottleneck put Solution: Queue Lock

Art of Multiprocessor Programming 7 put Simple Locking Implementation Problem: hot- spot contention Problem: sequential bottleneck put Solution: Queue Lock Solution ???

Art of Multiprocessor Programming 8 Counting Implementation remove put

Art of Multiprocessor Programming 9 Counting Implementation Only the counters are sequential remove put

Art of Multiprocessor Programming 10 Shared Counter

Art of Multiprocessor Programming 11 Shared Counter No duplication

Art of Multiprocessor Programming 12 Shared Counter No duplication No Omission

Art of Multiprocessor Programming 13 Shared Counter Not necessarily linearizable No duplication No Omission

Art of Multiprocessor Programming 14 Shared Counters Can we build a shared counter with –Low memory contention, and –Real parallelism? Locking –Can use queue locks to reduce contention –No help with parallelism issue …

Art of Multiprocessor Programming 15 Software Combining Tree 4 Contention: All spinning local Parallelism: Potential n/log n speedup

Art of Multiprocessor Programming 16 Combining Trees 0

Art of Multiprocessor Programming 17 Combining Trees 0 +3

Art of Multiprocessor Programming 18 Combining Trees

Art of Multiprocessor Programming 19 Combining Trees Two threads meet, combine sums

Art of Multiprocessor Programming 20 Combining Trees Two threads meet, combine sums +5

Art of Multiprocessor Programming 21 Combining Trees Combined sum added to root

Art of Multiprocessor Programming 22 Combining Trees Result returned to children

Art of Multiprocessor Programming 23 Combining Trees Results returned to threads

Art of Multiprocessor Programming 24 Devil in the Details What if –threads don’t arrive at the same time? Wait for a partner to show up? –How long to wait? –Waiting times add up … Instead –Use multi-phase algorithm –Try to wait in parallel …

Art of Multiprocessor Programming 25 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT};

Art of Multiprocessor Programming 26 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; Nothing going on

Art of Multiprocessor Programming 27 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; 1 st thread partner for combining, will return soon to check for 2 nd thread

Art of Multiprocessor Programming 28 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; 2 nd thread arrived with value for combining

Art of Multiprocessor Programming 29 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; 1 st thread has completed operation & deposited result for 2 nd thread

Art of Multiprocessor Programming 30 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; Special case: root node

Art of Multiprocessor Programming 31 Node Synchronization Short-term –Synchronized methods –Consistency during method call Long-term –Boolean locked field –Consistency across calls

Art of Multiprocessor Programming 32 Phases Precombining –Set up combining rendez-vous Combining –Collect and combine operations Operation –Hand off to higher thread Distribution –Distribute results to waiting threads

Art of Multiprocessor Programming 33 Precombining Phase 0 Examine status IDLE

Art of Multiprocessor Programming 34 Precombining Phase 0 0 If IDLE, promise to return to look for partner FIRST

Art of Multiprocessor Programming 35 Precombining Phase 0 At ROOT, turn back FIRST

Art of Multiprocessor Programming 36 Precombining Phase 0 FIRST

Art of Multiprocessor Programming 37 Precombining Phase 0 0 SECOND If FIRST, I’m willing to combine, but lock for now

Art of Multiprocessor Programming 38 Code Tree class –In charge of navigation Node class –Combining state –Synchronization state –Bookkeeping

Art of Multiprocessor Programming 39 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;

Art of Multiprocessor Programming 40 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Start at leaf

Art of Multiprocessor Programming 41 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Move up while instructed to do so

Art of Multiprocessor Programming 42 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Remember where we stopped

Art of Multiprocessor Programming 43 Precombining Node synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() }

Art of Multiprocessor Programming 44 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Precombining Node Short-term synchronization

Art of Multiprocessor Programming 45 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Synchronization Wait while node is locked

Art of Multiprocessor Programming 46 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Precombining Node Check combining status

Art of Multiprocessor Programming 47 Node was IDLE synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } I will return to look for combining value

Art of Multiprocessor Programming 48 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Continue up the tree

Art of Multiprocessor Programming 49 I’m the 2 nd Thread synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } If 1 st thread has promised to return, lock node so it won’t leave without me

Art of Multiprocessor Programming 50 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Prepare to deposit 2 nd value

Art of Multiprocessor Programming 51 Precombining Node synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } End of phase 1, don’t continue up tree

Art of Multiprocessor Programming 52 Node is the Root synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } If root, phase 1 ends, don’t continue up tree

Art of Multiprocessor Programming 53 Precombining Node synchronized boolean phase1() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Always check for unexpected values!

Art of Multiprocessor Programming 54 Combining Phase 0 0 SECOND 1 st thread locked out until 2 nd provides value +3

Art of Multiprocessor Programming 55 Combining Phase 0 0 SECOND 2 nd thread deposits value to be combined, unlocks node, & waits … 2 +3 zzz

Art of Multiprocessor Programming 56 Combining Phase SECOND st thread moves up the tree with combined value … zzz

Art of Multiprocessor Programming 57 Combining (reloaded) nd thread has not yet deposited value … FIRST

Art of Multiprocessor Programming 58 Combining (reloaded) 0 +3 FIRST 1 st thread is alone, locks out late partner

Art of Multiprocessor Programming 59 Combining (reloaded) 0 +3 FIRST Stop at root

Art of Multiprocessor Programming 60 Combining (reloaded) 0 +3 FIRST 2 nd thread’s phase 1 visit locked out

Art of Multiprocessor Programming 61 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Art of Multiprocessor Programming 62 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Start at leaf

Art of Multiprocessor Programming 63 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Add 1

Art of Multiprocessor Programming 64 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Revisit nodes visited in phase 1

Art of Multiprocessor Programming 65 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Accumulate combined values, if any

Art of Multiprocessor Programming 66 node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Combining Navigation We will retraverse path in reverse order …

Art of Multiprocessor Programming 67 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Move up the tree

Art of Multiprocessor Programming 68 Combining Phase Node synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … }

Art of Multiprocessor Programming 69 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Wait until node is unlocked

Art of Multiprocessor Programming 70 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Lock out late attempts to combine

Art of Multiprocessor Programming 71 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Remember our contribution

Art of Multiprocessor Programming 72 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Check status

Art of Multiprocessor Programming 73 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node 1 st thread is alone

Art of Multiprocessor Programming 74 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Node Combine with 2 nd thread

Art of Multiprocessor Programming 75 Operation Phase Add combined value to root, start back down (phase 4) zzz

Art of Multiprocessor Programming 76 Operation Phase (reloaded) 5 Leave value to be combined … SECOND 2

Art of Multiprocessor Programming 77 Operation Phase (reloaded) 5 +2 Unlock, and wait … SECOND 2 zzz

Art of Multiprocessor Programming 78 Operation Phase Navigation prior = stop.op(combined);

Art of Multiprocessor Programming 79 Operation Phase Navigation prior = stop.op(combined); Get result of combining

Art of Multiprocessor Programming 80 Operation Phase Node synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: …

Art of Multiprocessor Programming 81 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … At Root Add sum to root, return prior value

Art of Multiprocessor Programming 82 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Deposit value for later combining …

Art of Multiprocessor Programming 83 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node, notify 1 st thread

Art of Multiprocessor Programming 84 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Wait for 1 st thread to deliver results

Art of Multiprocessor Programming 85 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node & return

Art of Multiprocessor Programming 86 Distribution Phase 5 0 zzz Move down with result SECOND

Art of Multiprocessor Programming 87 Distribution Phase 5 zzz Leave result for 2 nd thread & lock node SECOND 2

Art of Multiprocessor Programming 88 Distribution Phase 5 0 zzz Move result back down tree SECOND 2

Art of Multiprocessor Programming 89 Distribution Phase 5 2 nd thread awakens, unlocks, takes value IDLE 3

Art of Multiprocessor Programming 90 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;

Art of Multiprocessor Programming 91 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Traverse path in reverse order

Art of Multiprocessor Programming 92 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Distribute results to waiting 2 nd threads

Art of Multiprocessor Programming 93 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Return result to caller

Art of Multiprocessor Programming 94 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.DONE; notifyAll(); return; default: …

Art of Multiprocessor Programming 95 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.DONE; notifyAll(); return; default: … No combining, unlock node & reset

Art of Multiprocessor Programming 96 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.DONE; notifyAll(); return; default: … Notify 2 nd thread that result is available

Art of Multiprocessor Programming 97 Bad News: High Latency Log n

Art of Multiprocessor Programming 98 Good News: Real Parallelism threads 1 thread

Art of Multiprocessor Programming 99 Throughput Puzzles Ideal circumstances –All n threads move together, combine –n increments in O(log n) time Worst circumstances –All n threads slightly skewed, locked out –n increments in O(n · log n) time

Art of Multiprocessor Programming 100 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

Art of Multiprocessor Programming 101 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} How many iterations

Art of Multiprocessor Programming 102 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Expected time between incrementing counter

Art of Multiprocessor Programming 103 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Take a number

Art of Multiprocessor Programming 104 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Pretend to work (more work, less concurrency)

Art of Multiprocessor Programming 105 Performance Benchmarks Alewife –NUMA architecture –Simulated Throughput: –average number of inc operations in 1 million cycle period. Latency: –average number of simulator cycles per inc operation.

Art of Multiprocessor Programming 106 Performance Latency: Throughput: Splock Ctree[n] num of processors cycles per operation Splock Ctree[n] operations per million cycles num of processors work = 0

Art of Multiprocessor Programming 107 Performance Latency: Throughput: Splock Ctree[n] num of processors cycles per operation Splock Ctree[n] operations per million cycles num of processors work = 0

Art of Multiprocessor Programming 108 The Combining Paradigm Implements any RMW operation When tree is loaded –Takes 2 log n steps –for n requests Very sensitive to load fluctuations: –if the arrival rates drop –the combining rates drop –overall performance deteriorates!

Art of Multiprocessor Programming 109 Conclusions Combining Trees –Work well under high contention –Sensitive to load fluctuations –Can be used for getAndMumble() ops Next lecture –Counting networks –A different approach …

640 Class Project Can be done in groups of 2 Presentation during week of 4/27 Written report (5-10 pages) Topics –Parallelization –Verification –Some relevant topic of interest to you

Parallelization How to efficiently solve a problem in parallel –Lonestar Benchmark (e.g. Delauney triangulation, N-body simulation) –Diffusion Implement a solution –Evaluate speed-up –Try optimizations using fine-grained synchronization

Verification Debugging parallel programs is hard –Use MSR’s Chess tool for state-space exporation of C/C++/C# programs –Produce “correct” implementations of some of the concurrent data structures we studied in C# Formal verification of concurrent data structures –Can we prove correctness? –In some theorem prover, e.g., Coq

A Balancer Input wires Output wires

Tokens Traverse Balancers Token i enters on any wire leaves on wire i mod (fan-out)

Tokens Traverse Balancers

Arbitrary input distribution Balanced output distribution

Smoothing Network k-smooth property

Counting Network step property

Counting Networks Count! 0, 4, , 5, , 6, ,

Bitonic[4]

Counting Networks Good for counting number of tokens low contention no sequential bottleneck high throughput practical networks depth

Bitonic[k] is not Linearizable

2

2 0

2 0 Problem is: Red finished before Yellow started Red took 2 Yellow took 0

Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }

Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } state

Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } Output connections to balancers

Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } Get-and-complement

Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; }

Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Stop when we get to the end

Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Flip state

Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Exit on wire

Bitonic[2k] Schematic Bitonic[k] Merger[2k]

Bitonic[2k] Layout

Unfolded Bitonic Network

Merger[8]

Unfolded Bitonic Network

Merger[4]

Unfolded Bitonic Network

Merger[2]

Bitonic[k] Depth Width k Depth is (log 2 k)(log 2 k + 1)/2

Merger[2k]

Merger[2k] Schematic Merger[k]

Merger[2k] Layout

Lemma If a sequence has the step property …

Lemma So does its even subsequence

Lemma And its odd subsequence

Merger[2k] Schematic Merger[k] Bitonic[k] even odd even

Proof Outline Outputs from Bitonic[k] Inputs to Merger[k] even odd even

Proof Outline Inputs to Merger[k] even odd even Outputs of Merger[k]

Proof Outline Outputs of Merger[k]Outputs of last layer

Periodic Network Block

Block[2k] Schematic Block[k]

Block[2k] Layout

Periodic[8]

Network Depth Each block[k] has depth log 2 k Need log 2 k blocks Grand total of (log 2 k) 2

Lower Bound on Depth Theorem: The depth of any width w counting network is at least Ω(log w). Theorem: there exists a counting network of θ(log w) depth. Unfortunately, proof is non-constructive and constants in the 1000s.

Sequential Theorem If a balancing network counts –Sequentially, meaning that –Tokens traverse one at a time Then it counts –Even if tokens traverse concurrently

Red First, Blue Second (2)

Blue First, Red Second (2)

Either Way Same balancer states

Order Doesn’t Matter Same balancer states Same output distribution

Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); }

Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput Higher is better!

Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Higher is better!

Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Combining and counting are pretty close

Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition!

Throughput vs. Size Bitonic[16] Bitonic[4] Bitonic[8] Number processors Throughput

Shared Pool Put counting network Remove counting network

Put/Remove Network Guarantees never: –Put waiting for item, while –Get has deposited item Otherwise OK to wait –Put delayed while pool slot is full –Get delayed while pool slot is empty

What About Decrements Adding arbitrary values Other operations –Multiplication –Vector addition..

Anti-Tokens

Tokens & Anti-Tokens Cancel

As if nothing happened

Tokens vs Antitokens Tokens –read balancer –flip –proceed Antitokens –flip balancer –read –proceed

Pumping Lemma Keep pumping tokens through one wire Eventually, after Ω tokens, network repeats a state

Anti-Token Effect token anti-token

Observation Each anti-token on wire i –Has same effect as Ω-1 tokens on wire i –So network still in legal state Moreover, network width w divides Ω –So Ω-1 tokens

Before Antitoken

Balancer states as if … Ω-1 Ω-1 is one brick shy of a load

Post Antitoken Next token shows up here

Implication Counting networks with –Tokens (+1) –Anti-tokens (-1) Give –Highly concurrent –Low contention getAndIncrement + getAndDecrement methods QED