Shared Counters and Parallelism Multiprocessor Synchronization Nir Shavit Spring 2003
Unordered set of objects A Shared Pool public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects Put Inserts object blocks if full Remove Removes & returns an object blocks if empty 9-May-19 M. Herlihy & N. Shavit (c) 2003
Simple Locking Implementation Solution: Queue Lock put Problem: hot-spot contention put Problem: sequential bottleneck Solution??? 9-May-19 M. Herlihy & N. Shavit (c) 2003
Counting Implementation put 19 20 21 19 20 21 remove Only the counters are sequential 9-May-19 M. Herlihy & N. Shavit (c) 2003
Not necessarily linearizable Shared Counter No duplication No Omission 3 2 1 1 2 3 Not necessarily linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003
Shared Counters Can we build a shared counter with Locking Low memory contention, and Real parallelism? Locking Can use queue locks to reduce contention No help with parallelism issue … 9-May-19 M. Herlihy & N. Shavit (c) 2003
Software Combining Tree P2 1,2,3 4 P3 P3 1 P5 P6 Pn Contention: only local spinning Combine increment requests up tree, wait for winners to propagate answers down Parallelism: high combining rate implies n/log n speed-up 9-May-19 M. Herlihy & N. Shavit (c) 2003
Two threads meet, combine sums Combining Trees +5 +3 Two threads meet, combine sums +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Combined sum added to root Combining Trees 5 Combined sum added to root +5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
New value returned to children Combining Trees 5 New value returned to children 5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Prior values returned to threads Combining Trees 5 5 Prior values returned to threads 3 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tree Node Status FREE COMBINE RESULT ROOT Nothing going on Waiting to combine RESULT Results ready ROOT Stop here 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase One: find path up to first COMBINE or ROOT node 5 Lock node, examine status +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase One: found FREE node 5 Set to COMBINE, unlock, move up FREE +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase One: found RESULT node 5 Unlock & wait for status change RESULT 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase One: found COMBINE or ROOT node 5 COMBINE Unlock, start phase two 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase Two: revisit path, combine if requested +5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase Three: stopped at COMBINE node +5 Lock node, deposit value, unlock & wait for my value to be combined by the other thread +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase Three: stopped at ROOT +5 Add my value to root and start back down +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
Phase Four: propagate values down COMBINE 5 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bad News: High Latency Log n +5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
Good News: Real Parallelism 1 thread +5 +2 +3 2 threads 9-May-19 M. Herlihy & N. Shavit (c) 2003
Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = fetch&inc(); Thread.sleep(random() % work); }} Take a number Pretend to work (more work, less concurrency) 9-May-19 M. Herlihy & N. Shavit (c) 2003
Performance Benchmarks Alewife DSM architecture Simulated Throughput: average number of inc operations in 1 million cycle period. Latency: average number of simulator cycles per inc operation. 9-May-19 M. Herlihy & N. Shavit (c) 2003
The Alewife Topology 9-May-19 M. Herlihy & N. Shavit (c) 2003 * Posters courtesy of the MIT Architecture Group 9-May-19 M. Herlihy & N. Shavit (c) 2003
Performance Simulated 64 Node Alewife. Work=0 64-leaf combining tree Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
The Combining Paradigm Implements any RMW operation When tree is loaded Takes 2 log n steps for n requests Very sensitive to load fluctuations: if the arrival rates drop the combining rates drop overall performance deteriorates! 9-May-19 M. Herlihy & N. Shavit (c) 2003
Combining Load Sensitivity work = 500 Notice Load Fluctuations 9-May-19 M. Herlihy & N. Shavit (c) 2003
Combining Load Sensitivity Concurrency W=100 W=1000 W=5000 1 0.0 2 10.3 2.0 0.3 4 26.1 8.5 2.2 8 40.3 19.9 10.6 16 50.4 31.7 32 55.3 39.5 18.5 48 54.2 40.0 15.6 64 65.5 18.7 As work increases, concurrency decreases, and combining levels drop significantly… 9-May-19 M. Herlihy & N. Shavit (c) 2003
Better to Wait Longer Short wait Medium wait Indefinite wait 9-May-19 Figure~\ref{fig: ctree wait} compares combining tree latency when {\tt work} is high using 3 waiting policies: wait 16 cycles, wait 256 cycles, and wait indefinitely. When the number of processors is larger than 64, indefinite waiting is by far the best policy. This follows since an un-combined token message locks later received token messages from progressing until it returns from traversing the root, so a large performance penalty is paid for each un-combined message. Because the chances of combining are good at higher arrival rates we found that when {\tt work = 0}, simulation using more than four processors justify indefinite waiting. Indefinite wait 9-May-19 M. Herlihy & N. Shavit (c) 2003
Can we do better? Centralized: One slow process delays everybody! Synchronized: Wait for others to bring numbers. Distributed: Spread out work across machine Coordinated: Coordinate but do not wait 9-May-19 M. Herlihy & N. Shavit (c) 2003
. Counting Networks How to coordinate access to counters? P1 Counting Network C 0, 4, 8..... How to coordinate access to counters? P2 . C 1, 5, 9..... C 2, 6, 10.... C Pn 3, 7 ........ No duplication, No omission 9-May-19 M. Herlihy & N. Shavit (c) 2003
A Balancer Input wires Output wires 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tokens Traverse Balancers Token i enters on any wire leaves on wire i mod (fan-out) 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003
Tokens Traverse Balancers Arbitrary input distribution Balanced output distribution 9-May-19 M. Herlihy & N. Shavit (c) 2003
Formally: A Balancer x0+x1 x0= y0= 7 6 4 2 1 1 3 5 7 2 x1= 5 3 2 4 6 y1= x0+x1 2 Def: a quiescent state is one in which all tokens input on the input wires have exited on the output wires. 9-May-19 M. Herlihy & N. Shavit (c) 2003 quiescent state - when all tokens have exited
Smoothing Network k-smooth property 9-May-19 M. Herlihy & N. Shavit (c) 2003
Counting Network step property 9-May-19 M. Herlihy & N. Shavit (c) 2003
Counting Networks Count! 0, 4, 8..... 1, 5, 9..... 2, 6,10.... 3, 7 ........ 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Counting Networks Good for counting number of tokens low contention no sequential bottleneck high throughput 9-May-19 M. Herlihy & N. Shavit (c) 2003
Shared Memory Implementation 1+4k 0/1 0/1 0/1 2+4k 3+4k 0/1 0/1 0/1 4+4k 9-May-19 M. Herlihy & N. Shavit (c) 2003
Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } 9-May-19 M. Herlihy & N. Shavit (c) 2003
Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } 9-May-19 M. Herlihy & N. Shavit (c) 2003
Message-Passing Implementation 9-May-19 M. Herlihy & N. Shavit (c) 2003
The Bitonic Counting Network Bitonic[2k] inductive construction: x 1 2 3 4 5 6 7 Merger[2k] y 1 2 3 4 5 6 7 Bitonic[k] Bitonic[k] Need only show how to construct Merger[2k]. 9-May-19 M. Herlihy & N. Shavit (c) 2003
Merger[2k] even odd y0 x y1 y2 y3 y2k-2 y2k-1 Merger[k] z … 9-May-19 1 2 3 4 5 6 K-1 Merger[k] even odd z y0 y1 y2 y3 y2k-2 y2k-1 … 9-May-19 M. Herlihy & N. Shavit (c) 2003
Merger[4] Bitonic[2] Merger[4] even odd odd even 9-May-19 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y1 y2 Merger[k] y3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003
Merger[8] even x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 z0 x 1 2 3 4 5 6 7 y 1 2 3 4 5 6 7 z1 Merger[k] z2 Bitonic[k] z3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] Bitonic[k] even … z2k-2 z2k-1 Theorem: in any quiescent state, the outputs of Bitonic[w] have the step property. 9-May-19 M. Herlihy & N. Shavit (c) 2003
Merger[2k] Merger[2k] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Outputs from Bitonic[k] Odd-Even has at most one more Proof Outline even odd odd even Outputs from Bitonic[k] Inputs to Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Proof Outline even odd odd even Inputs to Merger[k] Outputs of Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Proof Outline Outputs of Merger[k] Outputs of last layer 9-May-19 M. Herlihy & N. Shavit (c) 2003
Depth of the Bitonic Network even x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y0 Bitonic[k] y1 y2 Merger[k] y3 odd … Width w x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Bitonic[k] Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003
Unfolded Bitonic Network Merger[8] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Unfolded Bitonic Network Merger[4] Merger[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Unfolded Bitonic Network Merger[2] Merger[2] Merger[2] Merger[2] 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[k] Depth Width w so depth is log 2 + log 4 + … + log w/2 + log w = (log w)(log w + 1)/2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Lower Bound on Depth Theorem: The depth of any width w counting network is at least W(log w). Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. 9-May-19 M. Herlihy & N. Shavit (c) 2003
The Periodic Network 9-May-19 M. Herlihy & N. Shavit (c) 2003
Network Depth Each block[k] has depth log2 k Need log2 k blocks Grand total of (log2 k)2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[k] is not Linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[k] is not Linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[k] is not Linearizable 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[k] is not Linearizable 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Bitonic[k] is not Linearizable Problem is: Red finished before Yellow started Red took 2 Yellow took 0 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
Sequential Theorem If a balancing network counts Then it counts Sequentially, meaning that Tokens traverse one at a time Then it counts Even if tokens traverse concurrently 9-May-19 M. Herlihy & N. Shavit (c) 2003
Red First, Blue Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)
Blue First, Red Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)
Either Way Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003
Same output distribution Order Doesn’t Matter Same output distribution Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003
Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } 9-May-19 M. Herlihy & N. Shavit (c) 2003
Performance (Simulated) Higher is better! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. 9-May-19 M. Herlihy & N. Shavit (c) 2003
Performance (Simulated) 64-leaf combining tree 80-balancer counting network Throughput Higher is better! Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
Performance (Simulated) 64-leaf combining tree 80-balancer counting network Combining and counting are pretty close Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
Performance (Simulated) 64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
Saturation and Performance Undersaturated P < w log w Optimal performance Saturated P = w log w Oversaturated P > w log w 9-May-19 M. Herlihy & N. Shavit (c) 2003
Throughput vs. Size Throughput Number processors Bitonic[16] The simulation was extended to 80 processors, one balancer per processor in The 16 wide network, and as can be seen the counting network scales… Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003