Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Art of Multiprocessor Programming Simple Video Game Prepare frame for display By graphics coprocessor “soft real-time” application Need at least 35 frames/second OK to mess up rarely Art of Multiprocessor Programming
Art of Multiprocessor Programming Simple Video Game while (true) { frame.prepare(); frame.display(); } Art of Multiprocessor Programming
Art of Multiprocessor Programming Simple Video Game while (true) { frame.prepare(); frame.display(); } What about overlapping work? 1st thread displays frame 2nd prepares next frame Art of Multiprocessor Programming
Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; Art of Multiprocessor Programming
Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; Even phases Art of Multiprocessor Programming
Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; odd phases Art of Multiprocessor Programming
Synchronization Problems How do threads stay in phase? Too early? “we render no frame before its time” Too late? Recycle memory before frame is displayed Art of Multiprocessor Programming
Ideal Parallel Computation 1 1 1 Art of Multiprocessor Programming
Ideal Parallel Computation 2 2 2 1 1 1 Art of Multiprocessor Programming
Real-Life Parallel Computation zzz… 1 1 Art of Multiprocessor Programming
Real-Life Parallel Computation 2 zzz… 1 1 Uh, oh Art of Multiprocessor Programming
Barrier Synchronization barrier Art of Multiprocessor Programming
Barrier Synchronization 1 1 1 Art of Multiprocessor Programming
Barrier Synchronization Until every thread has left here No thread enters here Art of Multiprocessor Programming
Art of Multiprocessor Programming Why Do We Care? Mostly of interest to Scientific & numeric computation Elsewhere Garbage collection Less common in systems programming Still important topic Art of Multiprocessor Programming
Art of Multiprocessor Programming Duality Dual to mutual exclusion Include others, not exclude them Same implementation issues Interaction with caches … Invalidation? Local spinning? Art of Multiprocessor Programming
Example: Parallel Prefix b c d before a a+b a+b+c a+b+c +d after Art of Multiprocessor Programming
Art of Multiprocessor Programming Parallel Prefix One thread Per entry a b c d Art of Multiprocessor Programming
Parallel Prefix: Phase 1 b c d a a+b b+c c+d Art of Multiprocessor Programming
Parallel Prefix: Phase 2 b c d a a+b a+b+c a+b+c +d Art of Multiprocessor Programming
Art of Multiprocessor Programming Parallel Prefix N threads can compute Parallel prefix Of N entries In log2 N rounds What if system is asynchronous? Why we need barriers Art of Multiprocessor Programming
Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Art of Multiprocessor Programming
Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Array of input values Art of Multiprocessor Programming
Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Thread index Art of Multiprocessor Programming
Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Shared barrier Art of Multiprocessor Programming
Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Initialize fields Art of Multiprocessor Programming
Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; a[i] += sum; d = d * 2; }}} Art of Multiprocessor Programming
Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Art of Multiprocessor Programming
Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Art of Multiprocessor Programming
Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Art of Multiprocessor Programming
Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Make sure everyone writes before anyone reads Art of Multiprocessor Programming
Barrier Implementations Cache coherence Spin on locally-cached locations? Spin on statically-defined locations? Latency How many steps? Symmetry Do all threads do the same thing? Art of Multiprocessor Programming
Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor Programming
Number threads not yet arrived Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Number threads not yet arrived Art of Multiprocessor Programming
Number threads participating Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Number threads participating Art of Multiprocessor Programming
Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Initialization Art of Multiprocessor Programming
Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Principal method Art of Multiprocessor Programming
If I’m last, reset fields for next time Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} If I’m last, reset fields for next time Art of Multiprocessor Programming
Otherwise, wait for everyone else Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Otherwise, wait for everyone else Art of Multiprocessor Programming
What’s wrong with this protocol? Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} What’s wrong with this protocol? Art of Multiprocessor Programming
Art of Multiprocessor Programming Reuse Barrier b = new Barrier(n); while ( mumble() ) { work(); b.await() } Do work repeat synchronize Art of Multiprocessor Programming
Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor Programming
Waiting for Phase 1 to finish Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 1 to finish Art of Multiprocessor Programming
Waiting for Phase 1 to finish Barriers Phase 1 is so over public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 1 to finish Art of Multiprocessor Programming
Art of Multiprocessor Programming Barriers Prepare for phase 2 public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} ZZZZZ…. Art of Multiprocessor Programming
Waiting for Phase 2 to finish Waiting for Phase 1 to finish Uh-Oh public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 2 to finish Waiting for Phase 1 to finish Art of Multiprocessor Programming
Art of Multiprocessor Programming Basic Problem One thread “wraps around” to start phase 2 While another thread is still waiting for phase 1 One solution: Always use two barriers Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Completed odd or even-numbered phase? Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Store sense for next phase Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Get new sense determined by last phase Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} If I’m last, reverse sense for next time Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Otherwise, wait for sense to flip Art of Multiprocessor Programming
Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Prepare sense for next phase Art of Multiprocessor Programming
Combining Tree Barriers Art of Multiprocessor Programming
Combining Tree Barriers Art of Multiprocessor Programming
Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming
Combining Tree Barrier Parent barrier in tree public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming
Combining Tree Barrier Am I last? public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming
Combining Tree Barrier Proceed to parent barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await();} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming
Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Prepare for next phase Art of Multiprocessor Programming
Combining Tree Barrier Notify others at this node public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming
Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} I’m not last, so wait for notification Art of Multiprocessor Programming
Combining Tree Barrier No sequential bottleneck Parallel getAndDecrement() calls Low memory contention Same reason Cache behavior Local spinning on bus-based architecture Not so good for NUMA Art of Multiprocessor Programming
Art of Multiprocessor Programming Remarks Everyone spins on sense field Local spinning on bus-based (good) Network hot-spot on distributed architecture (bad) Not really scalable Art of Multiprocessor Programming
Tournament Tree Barrier If tree nodes have fan-in 2 Don’t need to call getAndDecrement() Winner chosen statically At level i If i-th bit of id is 0, move up Otherwise keep back Art of Multiprocessor Programming
Tournament Tree Barriers root winner loser winner loser winner loser Art of Multiprocessor Programming
Tournament Tree Barriers All flags blue Art of Multiprocessor Programming
Tournament Tree Barriers Loser thread sets winner’s flag Art of Multiprocessor Programming
Tournament Tree Barriers Loser spins on own flag Art of Multiprocessor Programming
Tournament Tree Barriers Winner spins on own flag Art of Multiprocessor Programming
Tournament Tree Barriers Winner sees own flag, moves up, spins Art of Multiprocessor Programming
Tournament Tree Barriers Bingo! Art of Multiprocessor Programming
Tournament Tree Barriers Sense-reversing: next time use blue flags Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Art of Multiprocessor Programming
Notifications delivered here Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Notifications delivered here Art of Multiprocessor Programming
Other thead at same level Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Other thead at same level Art of Multiprocessor Programming
Parent (winner) or null (loser) Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Parent (winner) or null (loser) Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Am I the root? Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Art of Multiprocessor Programming
Parameter to save space… Tournament Barrier Sense is a Parameter to save space… void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Le root, c’est moi Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} I am already a winner Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Wait for partner Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Synchronize upstairs Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Inform partner Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Natural-born loser Art of Multiprocessor Programming
Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Tell partner I’m here Art of Multiprocessor Programming
Wait for notification from partner Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Wait for notification from partner Art of Multiprocessor Programming
Art of Multiprocessor Programming Remarks No need for read-modify-write calls Each thread spins on fixed location Good for bus-based architectures Good for NUMA architectures Art of Multiprocessor Programming
Dissemination Barrier At round i Thread A notifies thread A+2i (mod n) Requires log n rounds Art of Multiprocessor Programming
Dissemination Barrier +1 +2 +4 Art of Multiprocessor Programming
Art of Multiprocessor Programming Remarks Great for lovers of combinatorics Not so much fun for those who track memory footprints Art of Multiprocessor Programming
Art of Multiprocessor Programming Ideas So Far Sense-reversing Reuse without reinitializing Combining tree Like counters, locks … Tournament tree Optimized combining tree Dissemination barrier Intellectually Pleasing (matter of taste) Art of Multiprocessor Programming
Which is best for Multicore? On a cache coherent multicore chip: none of the above… Use a static tree barrier! Art of Multiprocessor Programming
Art of Multiprocessor Programming Static Tree Barrier All spin on cached copy of Done flag 2 Counter in parent: Num of children that have not arrived 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming
Art of Multiprocessor Programming Static Tree Barrier All spin on cached copy of Done flag All detect change in Done flag 1 2 1 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming
Art of Multiprocessor Programming Remarks Very little cache traffic Minimal space overhead Can be laid out optimally on NUMA Instead of Done bit, notification must be performed by passing sense down the static tree Art of Multiprocessor Programming
So…How will we make use of multicores? Back to Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! Must parallelize applications on a very fine grain!
Need Fine-Grained Locking The reason we get only 2.9 speedup Coarse Grained Fine Grained c 25% Shared 25% Shared c c c 75% Unshared 75% Unshared
Traditional Scaling Process 7x Speedup 3.6x 1.8x User code Recall the traditional scaling process for software: write it once, trust Intel to make the CPU faster to improve performance. Traditional Uniprocessor Time: Moore’s law Art of Multiprocessor Programming
Multicore Scaling Process Speedup 1.8x 7x 3.6x User code Multicore With multicores, we will have to parallelize the code to make software faster, and we cannot do this automatically (except in a limited way on the level of individual instructions). As noted, not so simple… Art of Multiprocessor Programming
Real-World Scaling Process Speedup 2.9x 2x 1.8x User code Multicore This is because splitting the application up to utilize the cores is not simple, and coordination among the various code parts requires care. Parallelization and Synchronization will require great care… Art of Multiprocessor Programming
Future: Smoother Scaling … Speedup 1.8x 7x 3.6x User code Abstractions (like TM) Multicore
Multicore Programming We are at the dawn of a new era… Still a lot of research and development to be done And a body of multicore programming experience to be accumelated by us all © 2006 Herlihy & Shavit
Thanks for Attending! Good luck with your programming… If you have questions or run into interesting problems, just email us: mph@cs.brown.edu shanir@cs.tau.ac.il © 2006 Herlihy & Shavit
Art of Multiprocessor Programming This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming