Download presentation
Presentation is loading. Please wait.
1
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
2
Art of Multiprocessor Programming 2 A Shared Pool Put –Inserts object –blocks if full Remove –Removes & returns an object –blocks if empty public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects
3
Art of Multiprocessor Programming 3 put Simple Locking Implementation put
4
Art of Multiprocessor Programming 4 put Simple Locking Implementation put Problem: hot- spot contention
5
Art of Multiprocessor Programming 5 put Simple Locking Implementation Problem: hot- spot contention Problem: sequential bottleneck put
6
Art of Multiprocessor Programming 6 put Simple Locking Implementation Problem: hot- spot contention Problem: sequential bottleneck put Solution: Queue Lock
7
Art of Multiprocessor Programming 7 put Simple Locking Implementation Problem: hot- spot contention Problem: sequential bottleneck put Solution: Queue Lock Solution ???
8
Art of Multiprocessor Programming 8 Counting Implementation 19 20 21 remove put 19 20 21
9
Art of Multiprocessor Programming 9 Counting Implementation 19 20 21 Only the counters are sequential remove put 19 20 21
10
Art of Multiprocessor Programming 10 Shared Counter 3 2 1 0 1 2 3
11
Art of Multiprocessor Programming 11 Shared Counter 3 2 1 0 1 2 3 No duplication
12
Art of Multiprocessor Programming 12 Shared Counter 3 2 1 0 1 2 3 No duplication No Omission
13
Art of Multiprocessor Programming 13 Shared Counter 3 2 1 0 1 2 3 Not necessarily linearizable No duplication No Omission
14
Art of Multiprocessor Programming 14 Shared Counters Can we build a shared counter with –Low memory contention, and –Real parallelism? Locking –Can use queue locks to reduce contention –No help with parallelism issue …
15
Art of Multiprocessor Programming 15 Software Combining Tree 4 Contention: All spinning local Parallelism: Potential n/log n speedup
16
Art of Multiprocessor Programming 16 Combining Trees 0
17
Art of Multiprocessor Programming 17 Combining Trees 0 +3
18
Art of Multiprocessor Programming 18 Combining Trees 0 +3 +2
19
Art of Multiprocessor Programming 19 Combining Trees 0 +3 +2 Two threads meet, combine sums
20
Art of Multiprocessor Programming 20 Combining Trees 0 +3 +2 Two threads meet, combine sums +5
21
Art of Multiprocessor Programming 21 Combining Trees 5 +3 +2 +5 Combined sum added to root
22
Art of Multiprocessor Programming 22 Combining Trees 5 +3 +2 0 Result returned to children
23
Art of Multiprocessor Programming 23 Combining Trees 5 0 0 3 0 Results returned to threads
24
Art of Multiprocessor Programming 24 Devil in the Details What if –threads don’t arrive at the same time? Wait for a partner to show up? –How long to wait? –Waiting times add up … Instead –Use multi-phase algorithm –Try to wait in parallel …
25
Art of Multiprocessor Programming 25 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT};
26
Art of Multiprocessor Programming 26 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; Nothing going on
27
Art of Multiprocessor Programming 27 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; 1 st thread ISO partner for combining, will return soon to check for 2 nd thread
28
Art of Multiprocessor Programming 28 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; 2 nd thread arrived with value for combining
29
Art of Multiprocessor Programming 29 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; 1 st thread has completed operation & deposited result for 2 nd thread
30
Art of Multiprocessor Programming 30 Combining Status enum CStatus{ IDLE, FIRST, SECOND, DONE, ROOT}; Special case: root node
31
Art of Multiprocessor Programming 31 Node Synchronization Short-term –Synchronized methods –Consistency during method call Long-term –Boolean locked field –Consistency across calls
32
Art of Multiprocessor Programming 32 Phases Precombining –Set up combining rendez-vous Combining –Collect and combine operations Operation –Hand off to higher thread Distribution –Distribute results to waiting threads
33
Art of Multiprocessor Programming 33 Precombining Phase 0 Examine status IDLE
34
Art of Multiprocessor Programming 34 Precombining Phase 0 0 If IDLE, promise to return to look for partner FIRST
35
Art of Multiprocessor Programming 35 Precombining Phase 0 At ROOT, turn back FIRST
36
Art of Multiprocessor Programming 36 Precombining Phase 0 FIRST
37
Art of Multiprocessor Programming 37 Precombining Phase 0 0 SECOND If FIRST, I’m willing to combine, but lock for now
38
Art of Multiprocessor Programming 38 Code Tree class –In charge of navigation Node class –Combining state –Synchronization state –Bookkeeping
39
Art of Multiprocessor Programming 39 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;
40
Art of Multiprocessor Programming 40 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Start at leaf
41
Art of Multiprocessor Programming 41 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Move up while instructed to do so
42
Art of Multiprocessor Programming 42 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Remember where we stopped
43
Art of Multiprocessor Programming 43 Precombining Node synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() }
44
Art of Multiprocessor Programming 44 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Precombining Node Short-term synchronization
45
Art of Multiprocessor Programming 45 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Synchronization Wait while node is locked
46
Art of Multiprocessor Programming 46 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Precombining Node Check combining status
47
Art of Multiprocessor Programming 47 Node was IDLE synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } I will return to look for combining value
48
Art of Multiprocessor Programming 48 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Continue up the tree
49
Art of Multiprocessor Programming 49 I’m the 2 nd Thread synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } If 1 st thread has promised to return, lock node so it won’t leave without me
50
Art of Multiprocessor Programming 50 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Prepare to deposit 2 nd value
51
Art of Multiprocessor Programming 51 Precombining Node synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } End of phase 1, don’t continue up tree
52
Art of Multiprocessor Programming 52 Node is the Root synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } If root, phase 1 ends, don’t continue up tree
53
Art of Multiprocessor Programming 53 Precombining Node synchronized boolean phase1() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Always check for unexpected values!
54
Art of Multiprocessor Programming 54 Combining Phase 0 0 SECOND 1 st thread locked out until 2 nd provides value +3
55
Art of Multiprocessor Programming 55 Combining Phase 0 0 SECOND 2 nd thread deposits value to be combined, unlocks node, & waits … 2 +3 zzz
56
Art of Multiprocessor Programming 56 Combining Phase +3 +2 +5 SECOND 2 0 1 st thread moves up the tree with combined value … zzz
57
Art of Multiprocessor Programming 57 Combining (reloaded) 0 0 2 nd thread has not yet deposited value … FIRST
58
Art of Multiprocessor Programming 58 Combining (reloaded) 0 +3 FIRST 1 st thread is alone, locks out late partner
59
Art of Multiprocessor Programming 59 Combining (reloaded) 0 +3 FIRST Stop at root
60
Art of Multiprocessor Programming 60 Combining (reloaded) 0 +3 FIRST 2 nd thread’s phase 1 visit locked out
61
Art of Multiprocessor Programming 61 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
62
Art of Multiprocessor Programming 62 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Start at leaf
63
Art of Multiprocessor Programming 63 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Add 1
64
Art of Multiprocessor Programming 64 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Revisit nodes visited in phase 1
65
Art of Multiprocessor Programming 65 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Accumulate combined values, if any
66
Art of Multiprocessor Programming 66 node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Combining Navigation We will retraverse path in reverse order …
67
Art of Multiprocessor Programming 67 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Move up the tree
68
Art of Multiprocessor Programming 68 Combining Phase Node synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … }
69
Art of Multiprocessor Programming 69 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Wait until node is unlocked
70
Art of Multiprocessor Programming 70 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Lock out late attempts to combine
71
Art of Multiprocessor Programming 71 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Remember our contribution
72
Art of Multiprocessor Programming 72 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Check status
73
Art of Multiprocessor Programming 73 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node 1 st thread is alone
74
Art of Multiprocessor Programming 74 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Node Combine with 2 nd thread
75
Art of Multiprocessor Programming 75 Operation Phase 5 +3 +2 +5 Add combined value to root, start back down (phase 4) zzz
76
Art of Multiprocessor Programming 76 Operation Phase (reloaded) 5 Leave value to be combined … SECOND 2
77
Art of Multiprocessor Programming 77 Operation Phase (reloaded) 5 +2 Unlock, and wait … SECOND 2 zzz
78
Art of Multiprocessor Programming 78 Operation Phase Navigation prior = stop.op(combined);
79
Art of Multiprocessor Programming 79 Operation Phase Navigation prior = stop.op(combined); Get result of combining
80
Art of Multiprocessor Programming 80 Operation Phase Node synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: …
81
Art of Multiprocessor Programming 81 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … At Root Add sum to root, return prior value
82
Art of Multiprocessor Programming 82 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Deposit value for later combining …
83
Art of Multiprocessor Programming 83 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node, notify 1 st thread
84
Art of Multiprocessor Programming 84 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Wait for 1 st thread to deliver results
85
Art of Multiprocessor Programming 85 synchronized int op(int combined) { switch (cStatus) { case ROOT: int oldValue = result; result += combined; return oldValue; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.DONE) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node & return
86
Art of Multiprocessor Programming 86 Distribution Phase 5 0 zzz Move down with result SECOND
87
Art of Multiprocessor Programming 87 Distribution Phase 5 zzz Leave result for 2 nd thread & lock node SECOND 2
88
Art of Multiprocessor Programming 88 Distribution Phase 5 0 zzz Move result back down tree SECOND 2
89
Art of Multiprocessor Programming 89 Distribution Phase 5 2 nd thread awakens, unlocks, takes value IDLE 3
90
Art of Multiprocessor Programming 90 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;
91
Art of Multiprocessor Programming 91 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Traverse path in reverse order
92
Art of Multiprocessor Programming 92 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Distribute results to waiting 2 nd threads
93
Art of Multiprocessor Programming 93 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Return result to caller
94
Art of Multiprocessor Programming 94 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.DONE; notifyAll(); return; default: …
95
Art of Multiprocessor Programming 95 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.DONE; notifyAll(); return; default: … No combining, unlock node & reset
96
Art of Multiprocessor Programming 96 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.DONE; notifyAll(); return; default: … Notify 2 nd thread that result is available
97
Art of Multiprocessor Programming 97 Bad News: High Latency +2 +3 +5 Log n
98
Art of Multiprocessor Programming 98 Good News: Real Parallelism +2 +3 +5 2 threads 1 thread
99
Art of Multiprocessor Programming 99 Throughput Puzzles Ideal circumstances –All n threads move together, combine –n increments in O(log n) time Worst circumstances –All n threads slightly skewed, locked out –n increments in O(n · log n) time
100
Art of Multiprocessor Programming 100 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}
101
Art of Multiprocessor Programming 101 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} How many iterations
102
Art of Multiprocessor Programming 102 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Expected time between incrementing counter
103
Art of Multiprocessor Programming 103 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Take a number
104
Art of Multiprocessor Programming 104 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Pretend to work (more work, less concurrency)
105
Art of Multiprocessor Programming 105 Performance Benchmarks Alewife –NUMA architecture –Simulated Throughput: –average number of inc operations in 1 million cycle period. Latency: –average number of simulator cycles per inc operation.
106
Art of Multiprocessor Programming 106 Performance Latency: Throughput: 90000 80000 60000 50000 30000 20000 0 0 50100150 200250 300 Splock Ctree[n] num of processors cycles per operation 100150 200250 300 90000 80000 70000 60000 50000 40000 30000 20000 Splock Ctree[n] operations per million cycles 50 10000 0 0 num of processors work = 0
107
Art of Multiprocessor Programming 107 Performance Latency: Throughput: 90000 80000 60000 50000 30000 20000 0 0 50100150 200250 300 Splock Ctree[n] num of processors cycles per operation 100150 200250 300 90000 80000 70000 60000 50000 40000 30000 20000 Splock Ctree[n] operations per million cycles 50 10000 0 0 num of processors work = 0
108
Art of Multiprocessor Programming 108 The Combining Paradigm Implements any RMW operation When tree is loaded –Takes 2 log n steps –for n requests Very sensitive to load fluctuations: –if the arrival rates drop –the combining rates drop –overall performance deteriorates!
109
Art of Multiprocessor Programming 109 Combining Load Sensitivity Notice Load Fluctuations Throughput processors
110
Art of Multiprocessor Programming 110 Combining Rate vs Work
111
Art of Multiprocessor Programming 111 Better to Wait Longer Short wait Indefinite wait Medium wait Throughput processors
112
Art of Multiprocessor Programming 112 Conclusions Combining Trees –Work well under high contention –Sensitive to load fluctuations –Can be used for getAndMumble() ops Next –Counting networks –A different approach …
113
Art of Multiprocessor Programming 113 A Balancer Input wires Output wires
114
Art of Multiprocessor Programming 114 Tokens Traverse Balancers Token i enters on any wire leaves on wire i mod (fan-out)
115
Art of Multiprocessor Programming 115 Tokens Traverse Balancers
116
Art of Multiprocessor Programming 116 Tokens Traverse Balancers
117
Art of Multiprocessor Programming 117 Tokens Traverse Balancers
118
Art of Multiprocessor Programming 118 Tokens Traverse Balancers
119
Art of Multiprocessor Programming 119 Tokens Traverse Balancers Arbitrary input distribution Balanced output distribution
120
Art of Multiprocessor Programming 120 Smoothing Network k-smooth property
121
Art of Multiprocessor Programming 121 Counting Network step property
122
Art of Multiprocessor Programming 122 Counting Networks Count! 0, 4, 8..... 1, 5, 9..... 2, 6,10.... 3, 7........
123
Art of Multiprocessor Programming 123 Bitonic[4]
124
Art of Multiprocessor Programming 124 Bitonic[4]
125
Art of Multiprocessor Programming 125 Bitonic[4]
126
Art of Multiprocessor Programming 126 Bitonic[4]
127
Art of Multiprocessor Programming 127 Bitonic[4]
128
Art of Multiprocessor Programming 128 Bitonic[4]
129
Art of Multiprocessor Programming 129 Counting Networks Good for counting number of tokens low contention no sequential bottleneck high throughput practical networks depth
130
Art of Multiprocessor Programming 130 Bitonic[k] is not Linearizable
131
Art of Multiprocessor Programming 131 Bitonic[k] is not Linearizable
132
Art of Multiprocessor Programming 132 Bitonic[k] is not Linearizable 2
133
Art of Multiprocessor Programming 133 Bitonic[k] is not Linearizable 2 0
134
Art of Multiprocessor Programming 134 Bitonic[k] is not Linearizable 2 0 Problem is: Red finished before Yellow started Red took 2 Yellow took 0
135
Art of Multiprocessor Programming 135 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
136
Art of Multiprocessor Programming 136 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } state
137
Art of Multiprocessor Programming 137 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } Output connections to balancers
138
Art of Multiprocessor Programming 138 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } Get-and-complement
139
Art of Multiprocessor Programming 139 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; }
140
Art of Multiprocessor Programming 140 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Stop when we get to the end
141
Art of Multiprocessor Programming 141 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Flip state
142
Art of Multiprocessor Programming 142 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Exit on wire
143
Art of Multiprocessor Programming 143 Alternative Implementation: Message-Passing
144
Art of Multiprocessor Programming 144 Bitonic[2k] Schematic Bitonic[k] Merger[2k]
145
Art of Multiprocessor Programming 145 Bitonic[2k] Layout
146
Art of Multiprocessor Programming 146 Unfolded Bitonic Network
147
Art of Multiprocessor Programming 147 Unfolded Bitonic Network Merger[8]
148
Art of Multiprocessor Programming 148 Unfolded Bitonic Network
149
Art of Multiprocessor Programming 149 Unfolded Bitonic Network Merger[4]
150
Art of Multiprocessor Programming 150 Unfolded Bitonic Network
151
Art of Multiprocessor Programming 151 Unfolded Bitonic Network Merger[2]
152
Art of Multiprocessor Programming 152 Bitonic[k] Depth Width k Depth is (log 2 k)(log 2 k + 1)/2
153
Art of Multiprocessor Programming 153 Merger[2k]
154
Art of Multiprocessor Programming 154 Merger[2k] Schematic Merger[k]
155
Art of Multiprocessor Programming 155 Merger[2k] Layout
156
Art of Multiprocessor Programming 156 Lemma If a sequence has the step property …
157
Art of Multiprocessor Programming 157 Lemma So does its even subsequence
158
Art of Multiprocessor Programming 158 Lemma And its odd subsequence
159
Art of Multiprocessor Programming 159 Merger[2k] Schematic Merger[k] Bitonic[k] even odd even
160
Art of Multiprocessor Programming 160 Proof Outline Outputs from Bitonic[k] Inputs to Merger[k] even odd even
161
Art of Multiprocessor Programming 161 Proof Outline Inputs to Merger[k] even odd even Outputs of Merger[k]
162
Art of Multiprocessor Programming 162 Proof Outline Outputs of Merger[k]Outputs of last layer
163
Art of Multiprocessor Programming 163 Periodic Network Block
164
Art of Multiprocessor Programming 164 Periodic Network Block
165
Art of Multiprocessor Programming 165 Periodic Network Block
166
Art of Multiprocessor Programming 166 Periodic Network Block
167
Art of Multiprocessor Programming 167 Block[2k] Schematic Block[k]
168
Art of Multiprocessor Programming 168 Block[2k] Layout
169
Art of Multiprocessor Programming 169 Periodic[8]
170
Art of Multiprocessor Programming 170 Network Depth Each block[k] has depth log 2 k Need log 2 k blocks Grand total of (log 2 k) 2
171
Art of Multiprocessor Programming 171 Lower Bound on Depth Theorem: The depth of any width w counting network is at least Ω(log w). Theorem: there exists a counting network of θ(log w) depth. Unfortunately, proof is non-constructive and constants in the 1000s.
172
Art of Multiprocessor Programming 172 Sequential Theorem If a balancing network counts –Sequentially, meaning that –Tokens traverse one at a time Then it counts –Even if tokens traverse concurrently
173
Art of Multiprocessor Programming 173 Red First, Blue Second (2)
174
Art of Multiprocessor Programming 174 Blue First, Red Second (2)
175
Art of Multiprocessor Programming 175 Either Way Same balancer states
176
Art of Multiprocessor Programming 176 Order Doesn’t Matter Same balancer states Same output distribution
177
Art of Multiprocessor Programming 177 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); }
178
Art of Multiprocessor Programming 178 Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput Higher is better!
179
Art of Multiprocessor Programming 179 Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Higher is better!
180
Art of Multiprocessor Programming 180 Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Combining and counting are pretty close
181
Art of Multiprocessor Programming 181 Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition!
182
Art of Multiprocessor Programming 182 Saturation and Performance Undersaturated P < w log w Saturated P = w log w Oversaturated P > w log w Optimal performance
183
Art of Multiprocessor Programming 183 Throughput vs. Size Bitonic[16] Bitonic[4] Bitonic[8] Number processors Throughput
184
Art of Multiprocessor Programming 184 Shared Pool Put counting network Remove counting network
185
Art of Multiprocessor Programming 185 Put/Remove Network Guarantees never: –Put waiting for item, while –Get has deposited item Otherwise OK to wait –Put delayed while pool slot is full –Get delayed while pool slot is empty
186
Art of Multiprocessor Programming 186 What About Decrements Adding arbitrary values Other operations –Multiplication –Vector addition –Horoscope casting …
187
Art of Multiprocessor Programming 187 First Step Can we decrement as well as increment? What goes up, must come down …
188
Art of Multiprocessor Programming 188 Anti-Tokens
189
Art of Multiprocessor Programming 189 Tokens & Anti-Tokens Cancel
190
Art of Multiprocessor Programming 190 Tokens & Anti-Tokens Cancel
191
Art of Multiprocessor Programming 191 Tokens & Anti-Tokens Cancel
192
Art of Multiprocessor Programming 192 Tokens & Anti-Tokens Cancel As if nothing happened
193
Art of Multiprocessor Programming 193 Tokens vs Antitokens Tokens –read balancer –flip –proceed Antitokens –flip balancer –read –proceed
194
Art of Multiprocessor Programming 194 Pumping Lemma Keep pumping tokens through one wire Eventually, after Ω tokens, network repeats a state
195
Art of Multiprocessor Programming 195 Anti-Token Effect token anti-token
196
Art of Multiprocessor Programming 196 Observation Each anti-token on wire i –Has same effect as Ω-1 tokens on wire i –So network still in legal state Moreover, network width w divides Ω –So Ω-1 tokens
197
Art of Multiprocessor Programming 197 Before Antitoken
198
Art of Multiprocessor Programming 198 Balancer states as if … Ω-1 Ω-1 is one brick shy of a load
199
Art of Multiprocessor Programming 199 Post Antitoken Next token shows up here
200
Art of Multiprocessor Programming 200 Implication Counting networks with –Tokens (+1) –Anti-tokens (-1) Give –Highly concurrent –Low contention getAndIncrement + getAndDecrement methods QED
201
Art of Multiprocessor Programming 201 Adding Networks Combining trees implement –Fetch&add –Add any number, not just 1 What about counting networks?
202
Art of Multiprocessor Programming 202 Fetch-and-add Beyond getAndIncrement + getAndDecrement What about getAndAdd(x)? –Atomically returns prior value –And adds x to value? Not to mention –getAndMultiply –getAndFourierTransform?
203
Art of Multiprocessor Programming 203 Bad News If an adding network –Supports n concurrent tokens Then every token must traverse –At least n-1 balancers –In sequential executions
204
Art of Multiprocessor Programming 204 Uh-Oh Adding network size depends on n –Like combining trees –Unlike counting networks High latency –Depth linear in n –Not logarithmic in w
205
Art of Multiprocessor Programming 205 Generic Counting Network +1 +2 2 2
206
Art of Multiprocessor Programming 206 First Token +1 +2 First token would visit green balancers if it runs solo
207
Art of Multiprocessor Programming 207 Claim Look at path of +1 token All other +2 tokens must visit some balancer on +1 token’s path
208
Art of Multiprocessor Programming 208 Second Token +1 Takes 0 +2
209
Art of Multiprocessor Programming 209 Second Token Takes 0 +2 +1 Takes 0 They can’t both take zero! +2
210
Art of Multiprocessor Programming 210 If Second avoids First’s Path Second token –Doesn’t observe first –First hasn’t run –Chooses 0 First token –Doesn’t observe second –Disjoint paths –Chooses 0
211
Art of Multiprocessor Programming 211 If Second avoids First’s Path Because +1 token chooses 0 –It must be ordered first –So +2 token ordered second –So +2 token should return 1 Something’s wrong!
212
Art of Multiprocessor Programming 212 Second Token +1 +2 Halt blue token before first green balancer +2
213
Art of Multiprocessor Programming 213 Third Token +1 Takes 0 or 2 +2
214
Art of Multiprocessor Programming 214 Third Token +2 +1 Takes 0 +2 Takes 0 or 2 They can’t both take zero, and they can’t take 0 and 2!
215
Art of Multiprocessor Programming 215 First,Second, & Third Tokens must be Ordered Third (+2) token –Did not observe +1 token –May have observed earlier +2 token –Takes an even number
216
Art of Multiprocessor Programming 216 First,Second, & Third Tokens must be Ordered Because +1 token’s path is disjoint –It chooses 0 –Ordered first –Rest take odd numbers But last token takes an even number Something’s wrong!
217
Art of Multiprocessor Programming 217 Third Token +1 +2 Halt blue token before first green balancer
218
Art of Multiprocessor Programming 218 Continuing in this way We can “park” a token –In front of a balancer –That token #1 will visit There are n-1 other tokens –Two wires per balancer –Path includes n-1 balancers!
219
Art of Multiprocessor Programming 219 Theorem In any adding network –In sequential executions –Tokens traverse at least n-1 balancers Same arguments apply to –Linearizable counting networks –Multiplying networks –And others
220
Art of Multiprocessor Programming 220 Clip Art
221
Art of Multiprocessor Programming 221 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to –http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.