Barrier Synchronization

Slides:

Advertisements

Similar presentations

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Advertisements

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.

Chapter 6: Process Synchronization

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.

Synchronization (Barriers) Parallel Processing (CS453)

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.

Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Multiprocessors – Locks

CS703 – Advanced Operating Systems

Background on the need for Synchronization

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Lecture 18: Coherence and Synchronization

Concurrent Objects Companion slides for

Task Scheduling for Multicore CPUs and NUMA Systems

Distributed Algorithms (22903)

Distributed Algorithms (22903)

Counting, Sorting, and Distributed Coordination

Designing Parallel Algorithms (Synchronization)

Barrier Synchronization

Theory of Computation Turing Machines.

Race Conditions & Synchronization

Lecture 21: Synchronization and Consistency

Combinatorial Topology and Distributed Computing

Lecture 22: Consistency Models, TM

Lecture: Coherence and Synchronization

Barrier Synchronization

Software Transactional Memory Should Not be Obstruction-Free

Programming with Shared Memory Specifying parallelism

The University of Adelaide, School of Computer Science

Lecture 1: Introduction

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture: Coherence, Synchronization

Programming with Shared Memory - 3 Recognizing parallelism

Foundations and Definitions

Shared Counters and Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Software Engineering and Architecture

Turing Machines Everything is an Integer

Nir Shavit Multiprocessor Synchronization Spring 2003

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming Simple Video Game Prepare frame for display By graphics coprocessor “soft real-time” application Need at least 35 frames/second OK to mess up rarely Art of Multiprocessor Programming

Art of Multiprocessor Programming Simple Video Game while (true) { frame.prepare(); frame.display(); } Art of Multiprocessor Programming

Art of Multiprocessor Programming Simple Video Game while (true) { frame.prepare(); frame.display(); } What about overlapping work? 1st thread displays frame 2nd prepares next frame Art of Multiprocessor Programming

Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; Art of Multiprocessor Programming

Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; even phases Art of Multiprocessor Programming

Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; odd phases Art of Multiprocessor Programming

Synchronization Problems How do threads stay in phase? Too early? “we render no frame before its time” Too late? Recycle memory before frame is displayed Art of Multiprocessor Programming

Ideal Parallel Computation 1 1 1 Art of Multiprocessor Programming

Ideal Parallel Computation 2 2 2 1 1 1 Art of Multiprocessor Programming

Real-Life Parallel Computation zzz… 1 1 Art of Multiprocessor Programming

Real-Life Parallel Computation 2 zzz… 1 1 Uh, oh Art of Multiprocessor Programming

Barrier Synchronization Art of Multiprocessor Programming

Barrier Synchronization 1 1 1 Art of Multiprocessor Programming

Barrier Synchronization Until every thread has left here No thread enters here Art of Multiprocessor Programming

Art of Multiprocessor Programming Why Do We Care? Mostly of interest to Scientific & numeric computation Elsewhere Garbage collection Less common in systems programming Still important topic Art of Multiprocessor Programming

Art of Multiprocessor Programming Duality Dual to mutual exclusion Include others, not exclude them Same implementation issues Interaction with caches … Invalidation? Local spinning? Art of Multiprocessor Programming

Example: Parallel Prefix b c d before a a+b a+b+c a+b+c +d after Art of Multiprocessor Programming

Art of Multiprocessor Programming Parallel Prefix One thread Per entry a b c d Art of Multiprocessor Programming

Parallel Prefix: Phase 1 b c d a a+b b+c c+d Art of Multiprocessor Programming

Parallel Prefix: Phase 2 b c d a a+b a+b+c a+b+c +d Art of Multiprocessor Programming

Art of Multiprocessor Programming Parallel Prefix N threads can compute Parallel prefix Of N entries In log2 N rounds What if system is asynchronous? Why we need barriers Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { int[] a; int i; Barrier b; void Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { int[] a; int i; Barrier b; void Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Array of input values Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { int[] a; int i; Barrier b; void Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Thread index Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { int[] a; int i; Barrier b; void Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Shared barrier Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { int[] a; int i; Barrier b; void Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Initialize fields Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; a[i] += sum; d = d * 2; }}} Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Make sure everyone writes before anyone reads Art of Multiprocessor Programming

Barrier Implementations Cache coherence Spin on locally-cached locations? Spin on statically-defined locations? Latency How many steps? Symmetry Do all threads do the same thing? Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor Programming

Number of threads not yet arrived Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Number of threads not yet arrived Art of Multiprocessor Programming

Number of threads participating Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Number of threads participating Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Initialization Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Principal method Art of Multiprocessor Programming

If I’m last, reset fields for next time Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} If I’m last, reset fields for next time Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Otherwise, wait for everyone else Art of Multiprocessor Programming

What’s wrong with this protocol? Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} What’s wrong with this protocol? Art of Multiprocessor Programming

Art of Multiprocessor Programming Reuse Barrier b = new Barrier(n); while ( mumble() ) { work(); b.await() } do work repeat synchronize Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor Programming

Waiting for Phase 1 to finish Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 1 to finish Art of Multiprocessor Programming

Waiting for Phase 1 to finish Barriers Phase 1 is so over public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 1 to finish Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers Prepare for phase 2 public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} ZZZZZ…. Art of Multiprocessor Programming

Waiting for Phase 2 to finish Waiting for Phase 1 to finish Uh-Oh public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 2 to finish Waiting for Phase 1 to finish Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Problem One thread “wraps around” to start phase 2 While another thread is still waiting for phase 1 One solution: Always use two barriers Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Art of Multiprocessor Programming

Sense-Reversing Barriers Completed odd or even-numbered phase? public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Sense must be volatile because of Java memory model so that JIT does not remove loop spinning on sense Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Store sense for next phase Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Get new sense determined by last phase Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} If I’m last, reverse sense for next time Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Otherwise, wait for sense to flip Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; volatile boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Prepare sense for next phase Art of Multiprocessor Programming

Combining Tree Barriers Art of Multiprocessor Programming

Combining Tree Barriers Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) parent.await() count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier Parent barrier in tree public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) parent.await() count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) parent.await() count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Am I last? Art of Multiprocessor Programming

Combining Tree Barrier Proceed to parent barrier public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) parent.await(); count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) parent.await() count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Prepare for next phase Art of Multiprocessor Programming

Combining Tree Barrier Notify others at this node public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) parent.await() count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await() count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} I’m not last, so wait for notification Art of Multiprocessor Programming

Combining Tree Barrier No sequential bottleneck Parallel getAndDecrement() calls Low memory contention Same reason Cache behavior Local spinning on bus-based architecture Not so good for NUMA Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks Everyone spins on sense field Local spinning on bus-based (good) Network hot-spot on distributed architecture (bad) Not really scalable Art of Multiprocessor Programming

Tournament Tree Barrier If tree nodes have fan-in 2 Don’t need to call getAndDecrement() Winner chosen statically At level i If i-th bit of id is 0, move up Otherwise keep back Art of Multiprocessor Programming

Tournament Tree Barriers root winner loser winner loser winner loser Art of Multiprocessor Programming

Tournament Tree Barriers All flags blue Art of Multiprocessor Programming

Tournament Tree Barriers Loser thread sets winner’s flag Art of Multiprocessor Programming

Tournament Tree Barriers Loser spins on own flag Art of Multiprocessor Programming

Tournament Tree Barriers Winner spins on own flag Art of Multiprocessor Programming

Tournament Tree Barriers Winner sees own flag, moves up, spins Art of Multiprocessor Programming

Tournament Tree Barriers Bingo! Art of Multiprocessor Programming

Tournament Tree Barriers Sense-reversing: next time use blue flags Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier class TBarrier { volatile boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Art of Multiprocessor Programming

Notifications delivered here Tournament Barrier class TBarrier { volatile boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Notifications delivered here Art of Multiprocessor Programming

Other thead at same level Tournament Barrier class TBarrier { volatile boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Other thead at same level Art of Multiprocessor Programming

Parent (winner) or null (loser) Tournament Barrier class TBarrier { volatile boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Parent (winner) or null (loser) Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier class TBarrier { volatile boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Am I the root? Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier Current sense void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Le root, c’est moi Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} I am already a winner Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Wait for partner Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Synchronize upstairs Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Inform partner Art of Multiprocessor Programming

Order is important (why?) Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Inform partner Order is important (why?) Art of Multiprocessor Programming 85 85

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Natural-born loser Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Tell partner I’m here Art of Multiprocessor Programming

Wait for notification from partner Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Wait for notification from partner Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks No need for read-modify-write calls Each thread spins on fixed location Good for bus-based architectures Good for NUMA architectures Art of Multiprocessor Programming

Dissemination Barrier At round i Thread A notifies thread A+2i (mod n) Requires log n rounds Art of Multiprocessor Programming

Dissemination Barrier +1 +2 +4 Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks Elegant Good source of homework problems Not cache-friendly Art of Multiprocessor Programming

Art of Multiprocessor Programming Ideas So Far Sense-reversing Reuse without reinitializing Combining tree Like counters, locks … Tournament tree Optimized combining tree Dissemination barrier Intellectually Pleasing (matter of taste) Art of Multiprocessor Programming

Which is best for Multicore? On a cache coherent multicore chip: perhaps none of the above… Here is another (arguably) better algorithm … Art of Multiprocessor Programming

One node per thread, statically assigned Static Tree Barrier We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent One node per thread, statically assigned Art of Multiprocessor Programming

Art of Multiprocessor Programming Static Tree Barrier Sense-reversing flag We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 96 96

Node has count of missing children Static Tree Barrier 2 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Node has count of missing children Art of Multiprocessor Programming 97 97

Art of Multiprocessor Programming Static Tree Barrier 2 Spin until zero … 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 98 98

My counter is zero, decrement parent Static Tree Barrier 2 2 1 My counter is zero, decrement parent We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 99 99

Art of Multiprocessor Programming Static Tree Barrier 2 2 1 Spin on done flag We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 100 100

Art of Multiprocessor Programming Static Tree Barrier 1 2 2 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 101 101

Art of Multiprocessor Programming Static Tree Barrier 1 2 2 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 102 102

Art of Multiprocessor Programming Static Tree Barrier 1 2 2 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 103 103

Art of Multiprocessor Programming Static Tree Barrier 1 2 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 104 104

Art of Multiprocessor Programming Static Tree Barrier 1 2 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 105 105

Art of Multiprocessor Programming Static Tree Barrier yowzah! 1 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 106 106

Art of Multiprocessor Programming Static Tree Barrier yowzah! 1 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 107 107

Art of Multiprocessor Programming Static Tree Barrier yowzah! 1 1 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 108 108

Art of Multiprocessor Programming Static Tree Barrier yes! yes! 1 yes! 1 yes! yes! We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 109 109

Art of Multiprocessor Programming Static Tree Barrier 2 1 1 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming 110 110

Art of Multiprocessor Programming Remarks Very little cache traffic Minimal space overhead On message-passing architecture Send notification & sense down tree Art of Multiprocessor Programming

The Nature of Progress* Some of the material in this lecture appears in chapter 3 of the textbook. The rest appear in the article: On the nature of progress, Herlihy and Shavit, 2008 which can be found at http: //www.cs.tau.ac.il/~shanir/progress.pdf Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Programming Many real-word data structures blocking (lock-based) implementations & non-blocking (no locks) implementations For example: linked lists, queues, stacks, hash maps,… Does this make sense?

Concurrent Programming Many data structures combine blocking & non-blocking methods Java™ concurrency package skiplists, hash tables, exchangers on 10 million desktops. Can seemingly contradictory conditions co-exist in same alg?…

Progress Conditions Deadlock-free: Starvation-free: Lock-free: Some thread eventually acquires lock. Starvation-free: Every thread eventually acquires lock. Lock-free: Some method call returns. Wait-free: Every method call returns. Obstruction-free: Every method call returns if it executes in isolation We will show an example shortly

List-Based Sets Unordered collection of elements No duplicates Methods Add() a new element Remove() an element Contains() if element is present

Coarse Grained Locking b d c Lock is starvation-free: every attempt to acquire the lock eventually succeeds.

Fine Grained (Lock Coupling) b d Overlapping locks detect overlapping operations Deadlock-free: some thread eventually acquires lock.

Optimistic Fine Grained b c e add(), remove(), contains() lock destination nodes in order Deadlock-free: some thread trying to acquire the locks eventually succeeds.

Obstruction-free contains() d Snapshot: if all nodes traversed twice are the same Obstruction-free: the method returns if it executes in isolation for long enough.

The Simple Snapshot is Obstruction-Free Put increasing labels on each entry Collect twice If both agree, We’re done Otherwise, Try again Collect1 Collect2 1 22 7 13 18 12 1 22 7 13 18 12 Re call: if none of the labels (timestamps) changed, then there was a point, after the end of the first collect, and before the start of the next collect, in which none of the registers were written to. The values collected correspond to the values that were all together in memory at that point in time. =

Obstruction-freedom In the simple snapshot alg: The update method is wait-free But scan is obstruction-free Completes if it executes in isolation (no concurrent updates).

Wait-free contains() a a b 1 d c e Use mark bit + list ordering b 1 d c e Use mark bit + list ordering Not marked  in the set Marked or missing  not in the set

Lazy List-based Set Alg b 1 d c e Combine blocking and non-blocking: deadlock-free add() and remove() and wait-free contains()

Lock-free List-Based Set Logical Removal = Set Mark Bit a a b 1 c c e d mark and reference CASed together CAS will fail We say that "an alg is lock-free/starvation-free/.../wait-free" if all its methods together provide the given property". For individual methods, the rule should apply to calls of the given type of method only (as is currently defined). Thus, our version of Michael's lock-free linked list has a wait-free contains(), obstruction-free add() and remove(), and the algorithm as a whole is lock-free. What it means is that if you want to guarantee successful add() calls you need to add backoff on the remove() calls... Another example is the Obs-free snapshot consisting of two collects, snapshot only fails because updates succeed and yet we say its obs-free. So the snapshot algorithm as a whole is lock-free, updates are wait-free, and the snapshot method as a whole is obstruction-free. Lock-free add() and remove() and wait-free contains()

So how can this make sense? Why have methods with different progress conditions? Let us try to understand this… Art of Multiprocessor Programming© Copyright Herlihy-Shavit 2007

Progress Conditions Deadlock-free: Starvation-free: Lock-free: Some thread eventually acquires lock. Starvation-free: Every thread eventually acquires lock. Lock-free: Some method call returns. Wait-free: Every method call returns. Obstruction-free: Every method call returns if it executes in isolation

A “Periodic Table” of Progress Conditions Non-Blocking Blocking All make progress Wait- free Obstruction- free Starvation- free Some make progress Lock- free Deadlock- free

More Formally Standard notion of abstract object Progress conditions relate to method calls of an object A thread is active if it takes an infinite number of concrete (machine level) steps And is suspended if not.

Flags courtesy of www.theodora.com/flags used with permission Maximal vs. Minimal Minimal progress some call eventually completes System matters, not individuals Maximal progress every call eventually completes. Individuals matter In some sense, the weakest interesting notion of progress requires that the system as a whole continues to advance. Consider a fixed history $H$. A collection of methods of a given object provides \emph{minimal progress} in $H$ if, in every suffix of $H$, some pending active invocation of one of the methods in the collection has a matching response. In other words, there is no point in the history where all threads that called abstract methods in the collection take an infinite number of concrete steps without returning. This condition might, for example, be useful for a thread pool, where we care about advancing the overall computation, but do not care whether individual threads are underutilized. The strongest notion of progress, and arguably the one most programmers actually want, requires that each individual thread continues to advance. A collection of methods of a given object provides \emph{maximal progress} in a history $H$ if in every suffix of $H$, every pending active invocation of a method in the collection has a matching response. In other words, there is no point in the history where a thread that calls the abstract method in the collection takes an infinite number of concrete steps without returning. This condition might be useful for a web server, where each thread represents a customer request, and we care about advancing each individual computation. The condition is the difference between the requirements of a thread pool versus those of a web server. This condition might, for example, be useful for a thread In the latter case the condition might be useful for a web server, where Flags courtesy of www.theodora.com/flags used with permission

The “Periodic Table” of Progress Conditions Non-Blocking Blocking Maximal progress Wait- free Obstruction- free Starvation- free Minimal progress Although these progress conditions may have seemed quite different, each provides either minimal or maximal progress with respect to some set of histories. The result is a simple and regular structure illustrated in the ``periodic table'' shown in Figure~\ref{figure:progress} (and its more complete counterpart in Figure~\ref{figure:clash}). These observations may appear so simple as to be obvious in retrospect, but we have never seen them described in this way. There are three dividing lines, two vertical and one horizontal, that split the five conditions. The leftmost vertical line separates dependent conditions from the rest. The lock-free and wait-free properties apply to any histories, while obstruction-freedom, starvation-freedom, and deadlock-freedom require some kind of external scheduler support to guarantee progress. The rightmost vertical line separates the blocking and non-blocking conditions. The lock-free, wait-free, and obstruction-free conditions are non-blocking: if a suspended thread stops at an arbitrary point in a method call, at least some active threads can make progress. The deadlock-free and starvation-free conditions do not have this property. Finally, the horizontal line separates the minimal and maximal progress conditions. The minimal conditions guarantee the system as a whole makes progress while the maximal conditions guarantee that each thread makes progress. For brevity, \emph{minimal} progress properties encompass the lock-free and deadlock-free properties, while \emph{maximal} properties encompass the wait-free, starvation-free, and obstruction-free properties. Later we will see several ways to cross this line. One way is ``helping'' (for lack of space not included in this extended abstract), an algorithmic technique that has threads help others so each and every thread makes progress. However, in many cases, algorithms that employ helping are costly. An alternative and less costly approach is to make additional assumptions on scheduling. Lock- free Deadlock- free

The Scheduler’s Role Multiprocessor progress properties: Are not about the guarantees a method's implementation provides. Are about scheduling needed to provide minimal or maximal progress. Thus, the various progress conditions are not about the progress guarantees their implementations must provide. All the properties in the table imply the same thing, maximal progress, yet they differ in the combination of scheduling assumptions necessary for an implementation to provide it. Put differently, programmers design lock-free, obstruction-free, or deadlock-free algorithms, but what they are implicitly assuming is that because of how schedulers on modern multiprocessors work, all method calls eventually complete as if they were wait-free.

Fair Scheduling A history is fair if each thread takes an infinite number of steps A method implementation is deadlock-free if it guarantees minimal progress in every fair history. The restriction to fair histories captures the informal requirement that each thread eventually leaves its critical section. The definition does not mention locks or critical sections because progress should be defined in terms of completed method calls, not low-level mechanisms. Moreover, as noted, not all deadlock-free object implementations will have easily recognizable locks and critical sections. The requirement that the implementation provide maximal progress in some fair history is intended to rule out certain pathological cases. For example, the first thread to access an object might lock it and never release the lock. Such an implementation guarantees minimal progress (for the thread holding the lock) in every fair execution, but does not provide maximal progress in any execution. Clearly, such an implementation would not be considered acceptable in practice and is of no interest to us.

Starvation Freedom A method implementation is starvation-free if it guarantees maximal progress in every fair history. Progress extends to an object by considering all its methods together .

Dependent Progress Dependent progress conditions Independent ones do. Do not guarantee minimal progress in every history Independent ones do. Blocking progress conditions deadlock-freedom, Starvation-freedom are dependent. We say that a Progress is dependent if progress requires scheduler support

Non-blocking Independent Conditions A lock-free method guarantees minimal progress in every history. A wait-free method guarantees maximal progress The restriction to fair histories captures the informal requirement that each thread eventually leaves its critical section. The de¯nition does not mention locks or critical sections because progress should be de¯ned in terms of completed method calls, not low-level mechanisms. Moreover, as noted, not all deadlock-free object implementations will have easily recognizable locks and critical sections. The requirement that the implementation provide maximal progress in some fair history is intended to rule out certain pathological cases. For example, the ¯rst thread to access an object might lock it and never release the lock. Such an implementation guarantees minimal progress (for the thread holding the lock) in every fair execution, but does not provide maximal progress in any execution. Clearly, such an implementation would not be considered acceptable in practice and is of no interest to us.

The “Periodic Table” of Progress Conditions Non-Blocking Blocking Maximal progress Wait- free Obstruction- free Starvation- free Minimal progress On multiprocessors progress properties are not about the guarantees a method's implementation provides. Rather, they are about the assumptions one needs to make on the scheduler so that a method's implementation provides minimal or maximal progress. Lock- free Deadlock- free Dependent Independent

Uniformly Isolating Schedules A history is uniformly isolating if any thread eventually runs by itself for “long enough” Modern systems do this with backoff, yield, etc. Later is in the next chapter on spin locks

A Non-blocking Dependent Condition A method implementation is obstruction-free if it guarantees maximal progress in every uniformly isolating history. The restriction to fair histories captures the informal requirement that each thread eventually leaves its critical section. The de¯nition does not mention locks or critical sections because progress should be de¯ned in terms of completed method calls, not low-level mechanisms. Moreover, as noted, not all deadlock-free object implementations will have easily recognizable locks and critical sections. The requirement that the implementation provide maximal progress in some fair history is intended to rule out certain pathological cases. For example, the ¯rst thread to access an object might lock it and never release the lock. Such an implementation guarantees minimal progress (for the thread holding the lock) in every fair execution, but does not provide maximal progress in any execution. Clearly, such an implementation would not be considered acceptable in practice and is of no interest to us.

The “Periodic Table” of Progress Conditions Non-Blocking Blocking Uniform iso scheduler Maximal progress Wait- free Fair scheduler Obstruction- free Starvation- free In other words, there is no difference if we use blocking or non-blocking, they guarantee the same thing under the right scheduling assumptions. If I write a starvation-free alg I am assuming that I will get maximal progress, but that it will have to run on a machine with fair scheduling. Fair scheduler Minimal progress Lock- free Deadlock- free Independent Dependent

The “Periodic Table” of Progress Conditions Non-Blocking Blocking Maximal progress Wait- free Obstruction- free Starvation- free In other words, there is no difference if we use blocking or non-blocking, they guarantee the same thing under the right scheduling assumptions. If I write a starvation-free alg I am assuming that I will get maximal progress, but that it will have to run on a machine with fair scheduling. Minimal progress Lock- free Clash- free ? Deadlock- free Independent Dependent

Clash-Freedom: the “Einsteinium” of Progress A method implementation is clash-free if it guarantees minimal progress in every uniformly isolating history. Thm: clash-freedom strictly weaker than obstruction-freedom Like Einsteinium, symbol Es, atomic number 99, it does not occur naturally in any measurable quantities and has no commercial value. In the full paper we will show that being clash-free is strictly weaker than being obstruction-free, a result omitted from this extended abstract for lack of space. Clash-freedom thus answers the open question raised by Herlihy, Luchangco, and Moir [6], whether obstruction-freedom is the weakest natural non-blocking progress condition. Unlike Einsteinium is not radioactive but like it has of no commercial importance…

Getting from Minimal to Maximal Non-Blocking Blocking Maximal progress Wait- free Obstruction- free Starvation- free ? But helping is expensive In other words, there is no difference if we use blocking or non-blocking, they guarantee the same thing under the right scheduling assumptions. If I write a starvation-free alg I am assuming that I will get maximal progress, but that it will have to run on a machine with fair scheduling. Minimal progress Lock- free Clash- free ? Deadlock- free Helping Independent Dependent

Universal Constructions Lock-free universal construction provides minimal progress A scheduler is benevolent if it guarantees maximal progress anyway Real-world OS schedulers are benevolent mostly They do not persecute any individual thread

Getting from Minimal to Maximal Universal Lock-free Construction Getting from Minimal to Maximal Universal Wait-free Construction Non-Blocking Blocking Maximal progress Wait- free Obstruction- free Starvation- free ? For a one time object like consensus where each thread executes a method once, wait-free and lock-free are the same… In other words, there is no difference if we use blocking or non-blocking, they guarantee the same thing under the right scheduling assumptions. If I write a starvation-free alg I am assuming that I will get maximal progress, but that it will have to run on a machine with fair scheduling. Minimal progress Lock- free Clash- free ? Deadlock- free Helping Use Wait-free/Lock-free Consensus Objects Independent Dependent

Getting from Minimal to Maximal Universal Wait-free Construction Non-Blocking Blocking Universal Lock-free Construction Maximal progress Wait- free Obstruction- free Starvation- free In other words, there is no difference if we use blocking or non-blocking, they guarantee the same thing under the right scheduling assumptions. If I write a starvation-free alg I am assuming that I will get maximal progress, but that it will have to run on a machine with fair scheduling. Minimal progress Lock- free Clash- free ? Deadlock- free If we use Starvation-free/Deadlock-free Consensus Objects result is respectively Starvation-free/Deadlock-free Independent Dependent

Maximal Progress Postulate Programmers want maximal progress. Methods’ progress conditions define What we expect from the scheduler For example Don’t halt in critical section Let me run in isolation long enough …

Art of Multiprocessor Programming Why Lock-Free is OK We all want maximal progress Wait-free Yet we often write lock-free or deadlock-free lock-based algorithms OK if we expect the scheduler to be benevolent Often true (not always!) Art of Multiprocessor Programming

Shared-Memory Computability 10011 What is (and is not) concurrently computable Wait-free Atomic Registers Lock-free/Wait-free Hierarchy and Universal Constructions In the same way, we make little attempt to make our initial constructions efficient. We are interested in understanding whether such constructions exist, and how they work, but they are not intended to be a practical model for computation, so we prefer easy-to-understand but inefficient constructions over complicated but efficient ones.

Troubling Intellectual Question… I think I think, therefore I think I am (Ambrose Bierce) Why use non-blocking lock-free and wait-free conditions when most code uses locks?

The Answer Not about being non-blocking… About being independent! Do not rely on the good behavior of the scheduler.

Reads and Writes Infinite tape Finite State Controller By Analogy to Church-Turing abanbnan 1 Reads and Writes Infinite tape Finite State Controller Using a dependent condition is like relying on an oracle to recognize languages… The dependency masks the true power of the concurrent object… The classical theory of sequential computing proceeds in stages. It starts with finite-state automata, moves on to push-down automata, and culminates in Turing Machines. A Turing machine is an idealized model of computation, consisting of a finite-state controller and an infinite tape. It is safe to say that anything you can’t do on a Turing Machine is something you can’t do period. If you can devise a Turing machine program to do something, it doesn’t mean you can do it in a practical sense, but it gives you a place to start working on the problem.

Shared-Memory Computability 10011 Independent progress: use Lock-free and Wait-free Memory Hierarchy and Universal Constructions In the same way, we make little attempt to make our initial constructions efficient. We are interested in understanding whether such constructions exist, and how they work, but they are not intended to be a practical model for computation, so we prefer easy-to-understand but inefficient constructions over complicated but efficient ones.

Programmers Expect the Best Programmers expect maximal progress. Progress conditions define scheduler requirements necessary to achieve it.

This Concludes Our Course Material Principles: Mathematical foundations of multicore programming Practice: How multicore software and the architectures it runs on embody these principles Art of Multiprocessor Programming

Art of Multiprocessor Programming This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming