Barrier Synchronization

Slides:

Advertisements

Similar presentations

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Introduction Companion slides for

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

ESE Einführung in Software Engineering X. CHAPTER Prof. O. Nierstrasz Wintersemester 2005 / 2006.

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 at Penn, Spring.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.

Win8 on Intel Programming Course The challenge Paul Guermonprez Intel Software

Synchronization (Barriers) Parallel Processing (CS453)

Multicore Programming

Multicore Programming Nir Shavit Tel Aviv University.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

1 Lecture #21 Shared Objects and Concurrent Programming This material is not available in the textbook. The online powerpoint presentations contain the.

Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.

Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Programming Language Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Based on: The art of multiprocessor programming Maurice Herlihy and Nir Shavit, 2008 Appendix A – Software Basics Appendix B – Hardware Basics Introduction.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.

Lecture 9 : Concurrent Data Structures Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Barrier Synchronization

CS5102 High Performance Computer Systems Thread-Level Parallelism

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Concurrent Objects Companion slides for

Task Scheduling for Multicore CPUs and NUMA Systems

Shared Counters and Parallelism

Byzantine-Resilient Colorless Computaton

EE 193: Parallel Computing

Counting, Sorting, and Distributed Coordination

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Simulations and Reductions

Combinatorial Topology and Distributed Computing

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Barrier Synchronization

CS533 Concepts of Operating Systems

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Shared Counters and Parallelism

Shared Counters and Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture 23: Transactional Memory

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming Simple Video Game Prepare frame for display By graphics coprocessor “soft real-time” application Need at least 35 frames/second OK to mess up rarely Art of Multiprocessor Programming

Art of Multiprocessor Programming Simple Video Game while (true) { frame.prepare(); frame.display(); } Art of Multiprocessor Programming

Art of Multiprocessor Programming Simple Video Game while (true) { frame.prepare(); frame.display(); } What about overlapping work? 1st thread displays frame 2nd prepares next frame Art of Multiprocessor Programming

Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; Art of Multiprocessor Programming

Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; Even phases Art of Multiprocessor Programming

Art of Multiprocessor Programming Two-Phase Rendering while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; odd phases Art of Multiprocessor Programming

Synchronization Problems How do threads stay in phase? Too early? “we render no frame before its time” Too late? Recycle memory before frame is displayed Art of Multiprocessor Programming

Ideal Parallel Computation 1 1 1 Art of Multiprocessor Programming

Ideal Parallel Computation 2 2 2 1 1 1 Art of Multiprocessor Programming

Real-Life Parallel Computation zzz… 1 1 Art of Multiprocessor Programming

Real-Life Parallel Computation 2 zzz… 1 1 Uh, oh Art of Multiprocessor Programming

Barrier Synchronization barrier Art of Multiprocessor Programming

Barrier Synchronization 1 1 1 Art of Multiprocessor Programming

Barrier Synchronization Until every thread has left here No thread enters here Art of Multiprocessor Programming

Art of Multiprocessor Programming Why Do We Care? Mostly of interest to Scientific & numeric computation Elsewhere Garbage collection Less common in systems programming Still important topic Art of Multiprocessor Programming

Art of Multiprocessor Programming Duality Dual to mutual exclusion Include others, not exclude them Same implementation issues Interaction with caches … Invalidation? Local spinning? Art of Multiprocessor Programming

Example: Parallel Prefix b c d before a a+b a+b+c a+b+c +d after Art of Multiprocessor Programming

Art of Multiprocessor Programming Parallel Prefix One thread Per entry a b c d Art of Multiprocessor Programming

Parallel Prefix: Phase 1 b c d a a+b b+c c+d Art of Multiprocessor Programming

Parallel Prefix: Phase 2 b c d a a+b a+b+c a+b+c +d Art of Multiprocessor Programming

Art of Multiprocessor Programming Parallel Prefix N threads can compute Parallel prefix Of N entries In log2 N rounds What if system is asynchronous? Why we need barriers Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Array of input values Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Thread index Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Shared barrier Art of Multiprocessor Programming

Art of Multiprocessor Programming Prefix class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; } Initialize fields Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; a[i] += sum; d = d * 2; }}} Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Art of Multiprocessor Programming

Where Do the Barriers Go? public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); a[i] += sum; d = d * 2; }}} Make sure everyone reads before anyone writes Make sure everyone writes before anyone reads Art of Multiprocessor Programming

Barrier Implementations Cache coherence Spin on locally-cached locations? Spin on statically-defined locations? Latency How many steps? Symmetry Do all threads do the same thing? Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor Programming

Number threads not yet arrived Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Number threads not yet arrived Art of Multiprocessor Programming

Number threads participating Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Number threads participating Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Initialization Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Principal method Art of Multiprocessor Programming

If I’m last, reset fields for next time Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} If I’m last, reset fields for next time Art of Multiprocessor Programming

Otherwise, wait for everyone else Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Otherwise, wait for everyone else Art of Multiprocessor Programming

What’s wrong with this protocol? Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} What’s wrong with this protocol? Art of Multiprocessor Programming

Art of Multiprocessor Programming Reuse Barrier b = new Barrier(n); while ( mumble() ) { work(); b.await() } Do work repeat synchronize Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor Programming

Waiting for Phase 1 to finish Barriers public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 1 to finish Art of Multiprocessor Programming

Waiting for Phase 1 to finish Barriers Phase 1 is so over public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 1 to finish Art of Multiprocessor Programming

Art of Multiprocessor Programming Barriers Prepare for phase 2 public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} ZZZZZ…. Art of Multiprocessor Programming

Waiting for Phase 2 to finish Waiting for Phase 1 to finish Uh-Oh public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Waiting for Phase 2 to finish Waiting for Phase 1 to finish Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Problem One thread “wraps around” to start phase 2 While another thread is still waiting for phase 1 One solution: Always use two barriers Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Completed odd or even-numbered phase? Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Store sense for next phase Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Get new sense determined by last phase Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} If I’m last, reverse sense for next time Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Otherwise, wait for sense to flip Art of Multiprocessor Programming

Sense-Reversing Barriers public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>… public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}} Prepare sense for next phase Art of Multiprocessor Programming

Combining Tree Barriers Art of Multiprocessor Programming

Combining Tree Barriers Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier Parent barrier in tree public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier Am I last? public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier Proceed to parent barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await();} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Prepare for next phase Art of Multiprocessor Programming

Combining Tree Barrier Notify others at this node public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} Art of Multiprocessor Programming

Combining Tree Barrier public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense; public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}} I’m not last, so wait for notification Art of Multiprocessor Programming

Combining Tree Barrier No sequential bottleneck Parallel getAndDecrement() calls Low memory contention Same reason Cache behavior Local spinning on bus-based architecture Not so good for NUMA Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks Everyone spins on sense field Local spinning on bus-based (good) Network hot-spot on distributed architecture (bad) Not really scalable Art of Multiprocessor Programming

Tournament Tree Barrier If tree nodes have fan-in 2 Don’t need to call getAndDecrement() Winner chosen statically At level i If i-th bit of id is 0, move up Otherwise keep back Art of Multiprocessor Programming

Tournament Tree Barriers root winner loser winner loser winner loser Art of Multiprocessor Programming

Tournament Tree Barriers All flags blue Art of Multiprocessor Programming

Tournament Tree Barriers Loser thread sets winner’s flag Art of Multiprocessor Programming

Tournament Tree Barriers Loser spins on own flag Art of Multiprocessor Programming

Tournament Tree Barriers Winner spins on own flag Art of Multiprocessor Programming

Tournament Tree Barriers Winner sees own flag, moves up, spins Art of Multiprocessor Programming

Tournament Tree Barriers Bingo! Art of Multiprocessor Programming

Tournament Tree Barriers Sense-reversing: next time use blue flags Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Art of Multiprocessor Programming

Notifications delivered here Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Notifications delivered here Art of Multiprocessor Programming

Other thead at same level Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Other thead at same level Art of Multiprocessor Programming

Parent (winner) or null (loser) Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Parent (winner) or null (loser) Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier class TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; … } Am I the root? Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Art of Multiprocessor Programming

Parameter to save space… Tournament Barrier Sense is a Parameter to save space… void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Le root, c’est moi Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} I am already a winner Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Wait for partner Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Synchronize upstairs Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Inform partner Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Natural-born loser Art of Multiprocessor Programming

Art of Multiprocessor Programming Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Tell partner I’m here Art of Multiprocessor Programming

Wait for notification from partner Tournament Barrier void await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { }}} Wait for notification from partner Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks No need for read-modify-write calls Each thread spins on fixed location Good for bus-based architectures Good for NUMA architectures Art of Multiprocessor Programming

Dissemination Barrier At round i Thread A notifies thread A+2i (mod n) Requires log n rounds Art of Multiprocessor Programming

Dissemination Barrier +1 +2 +4 Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks Great for lovers of combinatorics Not so much fun for those who track memory footprints Art of Multiprocessor Programming

Art of Multiprocessor Programming Ideas So Far Sense-reversing Reuse without reinitializing Combining tree Like counters, locks … Tournament tree Optimized combining tree Dissemination barrier Intellectually Pleasing (matter of taste) Art of Multiprocessor Programming

Which is best for Multicore? On a cache coherent multicore chip: none of the above… Use a static tree barrier! Art of Multiprocessor Programming

Art of Multiprocessor Programming Static Tree Barrier All spin on cached copy of Done flag 2 Counter in parent: Num of children that have not arrived 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming

Art of Multiprocessor Programming Static Tree Barrier All spin on cached copy of Done flag All detect change in Done flag 1 2 1 2 We describe the static tree using counters in the nodes but could use simple array of locations, each written by a child, all spun on (repeatedly read) by the parent Art of Multiprocessor Programming

Art of Multiprocessor Programming Remarks Very little cache traffic Minimal space overhead Can be laid out optimally on NUMA Instead of Done bit, notification must be performed by passing sense down the static tree Art of Multiprocessor Programming

So…How will we make use of multicores? Back to Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! Must parallelize applications on a very fine grain!

Need Fine-Grained Locking The reason we get only 2.9 speedup Coarse Grained Fine Grained c 25% Shared 25% Shared c c c 75% Unshared 75% Unshared

Traditional Scaling Process 7x Speedup 3.6x 1.8x User code Recall the traditional scaling process for software: write it once, trust Intel to make the CPU faster to improve performance. Traditional Uniprocessor Time: Moore’s law Art of Multiprocessor Programming

Multicore Scaling Process Speedup 1.8x 7x 3.6x User code Multicore With multicores, we will have to parallelize the code to make software faster, and we cannot do this automatically (except in a limited way on the level of individual instructions). As noted, not so simple… Art of Multiprocessor Programming

Real-World Scaling Process Speedup 2.9x 2x 1.8x User code Multicore This is because splitting the application up to utilize the cores is not simple, and coordination among the various code parts requires care. Parallelization and Synchronization will require great care… Art of Multiprocessor Programming

Future: Smoother Scaling … Speedup 1.8x 7x 3.6x User code Abstractions (like TM) Multicore

Multicore Programming We are at the dawn of a new era… Still a lot of research and development to be done And a body of multicore programming experience to be accumelated by us all © 2006 Herlihy & Shavit

Thanks for Attending! Good luck with your programming… If you have questions or run into interesting problems, just email us: mph@cs.brown.edu shanir@cs.tau.ac.il © 2006 Herlihy & Shavit

Art of Multiprocessor Programming This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming