Download presentation
Presentation is loading. Please wait.
1
Shared Counters and Parallelism
Multiprocessor Synchronization Nir Shavit Spring 2003
2
Unordered set of objects
A Shared Pool public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects Put Inserts object blocks if full Remove Removes & returns an object blocks if empty 9-May-19 M. Herlihy & N. Shavit (c) 2003
3
Simple Locking Implementation
Solution: Queue Lock put Problem: hot-spot contention put Problem: sequential bottleneck Solution??? 9-May-19 M. Herlihy & N. Shavit (c) 2003
4
Counting Implementation
put 19 20 21 19 20 21 remove Only the counters are sequential 9-May-19 M. Herlihy & N. Shavit (c) 2003
5
Not necessarily linearizable
Shared Counter No duplication No Omission 3 2 1 1 2 3 Not necessarily linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003
6
Shared Counters Can we build a shared counter with Locking
Low memory contention, and Real parallelism? Locking Can use queue locks to reduce contention No help with parallelism issue … 9-May-19 M. Herlihy & N. Shavit (c) 2003
7
Software Combining Tree
P2 1,2,3 4 P3 P3 1 P5 P6 Pn Contention: only local spinning Combine increment requests up tree, wait for winners to propagate answers down Parallelism: high combining rate implies n/log n speed-up 9-May-19 M. Herlihy & N. Shavit (c) 2003
8
Two threads meet, combine sums
Combining Trees +5 +3 Two threads meet, combine sums +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
9
Combined sum added to root
Combining Trees 5 Combined sum added to root +5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
10
New value returned to children
Combining Trees 5 New value returned to children 5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
11
Prior values returned to threads
Combining Trees 5 5 Prior values returned to threads 3 9-May-19 M. Herlihy & N. Shavit (c) 2003
12
Tree Node Status FREE COMBINE RESULT ROOT Nothing going on
Waiting to combine RESULT Results ready ROOT Stop here 9-May-19 M. Herlihy & N. Shavit (c) 2003
13
Phase One: find path up to first COMBINE or ROOT node
5 Lock node, examine status +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
14
Phase One: found FREE node
5 Set to COMBINE, unlock, move up FREE +2 9-May-19 M. Herlihy & N. Shavit (c) 2003
15
Phase One: found RESULT node
5 Unlock & wait for status change RESULT 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
16
Phase One: found COMBINE or ROOT node
5 COMBINE Unlock, start phase two 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
17
Phase Two: revisit path, combine if requested
+5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
18
Phase Three: stopped at COMBINE node
+5 Lock node, deposit value, unlock & wait for my value to be combined by the other thread +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
19
Phase Three: stopped at ROOT
+5 Add my value to root and start back down +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003
20
Phase Four: propagate values down
COMBINE 5 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
21
Bad News: High Latency Log n +5 +2 +3 9-May-19
M. Herlihy & N. Shavit (c) 2003
22
Good News: Real Parallelism
1 thread +5 +2 +3 2 threads 9-May-19 M. Herlihy & N. Shavit (c) 2003
23
Index Distribution Benchmark
void indexBench(int iters, int work) { while (int i < iters) { i = fetch&inc(); Thread.sleep(random() % work); }} Take a number Pretend to work (more work, less concurrency) 9-May-19 M. Herlihy & N. Shavit (c) 2003
24
Performance Benchmarks
Alewife DSM architecture Simulated Throughput: average number of inc operations in 1 million cycle period. Latency: average number of simulator cycles per inc operation. 9-May-19 M. Herlihy & N. Shavit (c) 2003
25
The Alewife Topology 9-May-19 M. Herlihy & N. Shavit (c) 2003
* Posters courtesy of the MIT Architecture Group 9-May-19 M. Herlihy & N. Shavit (c) 2003
26
Performance Simulated 64 Node Alewife. Work=0 64-leaf combining tree
Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
27
The Combining Paradigm
Implements any RMW operation When tree is loaded Takes 2 log n steps for n requests Very sensitive to load fluctuations: if the arrival rates drop the combining rates drop overall performance deteriorates! 9-May-19 M. Herlihy & N. Shavit (c) 2003
28
Combining Load Sensitivity
work = 500 Notice Load Fluctuations 9-May-19 M. Herlihy & N. Shavit (c) 2003
29
Combining Load Sensitivity
Concurrency W=100 W=1000 W=5000 1 0.0 2 10.3 2.0 0.3 4 26.1 8.5 2.2 8 40.3 19.9 10.6 16 50.4 31.7 32 55.3 39.5 18.5 48 54.2 40.0 15.6 64 65.5 18.7 As work increases, concurrency decreases, and combining levels drop significantly… 9-May-19 M. Herlihy & N. Shavit (c) 2003
30
Better to Wait Longer Short wait Medium wait Indefinite wait 9-May-19
Figure~\ref{fig: ctree wait} compares combining tree latency when {\tt work} is high using 3 waiting policies: wait 16 cycles, wait 256 cycles, and wait indefinitely. When the number of processors is larger than 64, indefinite waiting is by far the best policy. This follows since an un-combined token message locks later received token messages from progressing until it returns from traversing the root, so a large performance penalty is paid for each un-combined message. Because the chances of combining are good at higher arrival rates we found that when {\tt work = 0}, simulation using more than four processors justify indefinite waiting. Indefinite wait 9-May-19 M. Herlihy & N. Shavit (c) 2003
31
Can we do better? Centralized: One slow process delays everybody!
Synchronized: Wait for others to bring numbers. Distributed: Spread out work across machine Coordinated: Coordinate but do not wait 9-May-19 M. Herlihy & N. Shavit (c) 2003
32
. Counting Networks How to coordinate access to counters?
P1 Counting Network C 0, 4, How to coordinate access to counters? P2 . C 1, 5, C 2, 6, C Pn 3, No duplication, No omission 9-May-19 M. Herlihy & N. Shavit (c) 2003
33
A Balancer Input wires Output wires 9-May-19
M. Herlihy & N. Shavit (c) 2003
34
Tokens Traverse Balancers
Token i enters on any wire leaves on wire i mod (fan-out) 9-May-19 M. Herlihy & N. Shavit (c) 2003
35
Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003
36
Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003
37
Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003
38
Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003
39
Tokens Traverse Balancers
Arbitrary input distribution Balanced output distribution 9-May-19 M. Herlihy & N. Shavit (c) 2003
40
Formally: A Balancer x0+x1 x0= y0= 7 6 4 2 1 1 3 5 7 2 x1= 5 3 2 4 6 y1= x0+x1 2 Def: a quiescent state is one in which all tokens input on the input wires have exited on the output wires. 9-May-19 M. Herlihy & N. Shavit (c) 2003 quiescent state - when all tokens have exited
41
Smoothing Network k-smooth property 9-May-19
M. Herlihy & N. Shavit (c) 2003
42
Counting Network step property 9-May-19
M. Herlihy & N. Shavit (c) 2003
43
Counting Networks Count!
0, 4, 1, 5, 2, 6,10.... 3, 9-May-19 M. Herlihy & N. Shavit (c) 2003
44
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
45
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
46
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
47
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
48
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
49
Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
50
Counting Networks Good for counting number of tokens low contention
no sequential bottleneck high throughput 9-May-19 M. Herlihy & N. Shavit (c) 2003
51
Shared Memory Implementation
1+4k 0/1 0/1 0/1 2+4k 3+4k 0/1 0/1 0/1 4+4k 9-May-19 M. Herlihy & N. Shavit (c) 2003
52
Shared Memory Implementation
class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } 9-May-19 M. Herlihy & N. Shavit (c) 2003
53
Shared Memory Implementation
Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } 9-May-19 M. Herlihy & N. Shavit (c) 2003
54
Message-Passing Implementation
9-May-19 M. Herlihy & N. Shavit (c) 2003
55
The Bitonic Counting Network
Bitonic[2k] inductive construction: x 1 2 3 4 5 6 7 Merger[2k] y 1 2 3 4 5 6 7 Bitonic[k] Bitonic[k] Need only show how to construct Merger[2k]. 9-May-19 M. Herlihy & N. Shavit (c) 2003
56
Merger[2k] even odd y0 x y1 y2 y3 y2k-2 y2k-1 Merger[k] z … 9-May-19
1 2 3 4 5 6 K-1 Merger[k] even odd z y0 y1 y2 y3 y2k-2 y2k-1 … 9-May-19 M. Herlihy & N. Shavit (c) 2003
57
Merger[4] Bitonic[2] Merger[4] even odd odd even 9-May-19
x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y1 y2 Merger[k] y3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003
58
Merger[8] even x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 z0 x 1 2 3 4 5 6 7 y 1 2 3 4 5 6 7 z1 Merger[k] z2 Bitonic[k] z3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] Bitonic[k] even … z2k-2 z2k-1 Theorem: in any quiescent state, the outputs of Bitonic[w] have the step property. 9-May-19 M. Herlihy & N. Shavit (c) 2003
59
Merger[2k] Merger[2k] 9-May-19 M. Herlihy & N. Shavit (c) 2003
60
Outputs from Bitonic[k]
Odd-Even has at most one more Proof Outline even odd odd even Outputs from Bitonic[k] Inputs to Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003
61
Proof Outline even odd odd even Inputs to Merger[k]
Outputs of Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003
62
Proof Outline Outputs of Merger[k] Outputs of last layer 9-May-19
M. Herlihy & N. Shavit (c) 2003
63
Depth of the Bitonic Network
even x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y0 Bitonic[k] y1 y2 Merger[k] y3 odd … Width w x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Bitonic[k] Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003
64
Unfolded Bitonic Network
Merger[8] 9-May-19 M. Herlihy & N. Shavit (c) 2003
65
Unfolded Bitonic Network
Merger[4] Merger[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003
66
Unfolded Bitonic Network
Merger[2] Merger[2] Merger[2] Merger[2] 9-May-19 M. Herlihy & N. Shavit (c) 2003
67
Bitonic[k] Depth Width w so depth is
log 2 + log 4 + … + log w/2 + log w = (log w)(log w + 1)/2 9-May-19 M. Herlihy & N. Shavit (c) 2003
68
Lower Bound on Depth Theorem: The depth of any width w counting network is at least W(log w). Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. 9-May-19 M. Herlihy & N. Shavit (c) 2003
69
The Periodic Network 9-May-19 M. Herlihy & N. Shavit (c) 2003
70
Network Depth Each block[k] has depth log2 k Need log2 k blocks
Grand total of (log2 k)2 9-May-19 M. Herlihy & N. Shavit (c) 2003
71
Bitonic[k] is not Linearizable
9-May-19 M. Herlihy & N. Shavit (c) 2003
72
Bitonic[k] is not Linearizable
9-May-19 M. Herlihy & N. Shavit (c) 2003
73
Bitonic[k] is not Linearizable
2 9-May-19 M. Herlihy & N. Shavit (c) 2003
74
Bitonic[k] is not Linearizable
2 9-May-19 M. Herlihy & N. Shavit (c) 2003
75
Bitonic[k] is not Linearizable
Problem is: Red finished before Yellow started Red took 2 Yellow took 0 2 9-May-19 M. Herlihy & N. Shavit (c) 2003
76
Sequential Theorem If a balancing network counts Then it counts
Sequentially, meaning that Tokens traverse one at a time Then it counts Even if tokens traverse concurrently 9-May-19 M. Herlihy & N. Shavit (c) 2003
77
Red First, Blue Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)
78
Blue First, Red Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)
79
Either Way Same balancer states 9-May-19
M. Herlihy & N. Shavit (c) 2003
80
Same output distribution
Order Doesn’t Matter Same output distribution Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003
81
Index Distribution Benchmark
void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } 9-May-19 M. Herlihy & N. Shavit (c) 2003
82
Performance (Simulated)
Higher is better! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. 9-May-19 M. Herlihy & N. Shavit (c) 2003
83
Performance (Simulated)
64-leaf combining tree 80-balancer counting network Throughput Higher is better! Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
84
Performance (Simulated)
64-leaf combining tree 80-balancer counting network Combining and counting are pretty close Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
85
Performance (Simulated)
64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
86
Saturation and Performance
Undersaturated P < w log w Optimal performance Saturated P = w log w Oversaturated P > w log w 9-May-19 M. Herlihy & N. Shavit (c) 2003
87
Throughput vs. Size Throughput Number processors Bitonic[16]
The simulation was extended to 80 processors, one balancer per processor in The 16 wide network, and as can be seen the counting network scales… Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.