Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Similar presentations


Presentation on theme: "Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit."— Presentation transcript:

1 Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

2 Art of Multiprocessor Programming 2 A Shared Pool Put –Insert item –block if full Remove –Remove & return item –block if empty public interface Pool { public void put(T x); public T remove(); }

3 Art of Multiprocessor Programming 3 put Simple Locking Implementation put

4 4 Simple Locking Implementation put Problem: hot- spot contention

5 5 put Simple Locking Implementation put Problem: hot- spot contention Problem: sequential bottleneck

6 Art of Multiprocessor Programming 6 put Simple Locking Implementation put Problem: hot- spot contention Problem: sequential bottleneck Solution: Queue Lock

7 Art of Multiprocessor Programming 7 put Simple Locking Implementation put Problem: hot- spot contention Problem: sequential bottleneck Solution: Queue Lock Solution:?

8 Art of Multiprocessor Programming 8 Counting Implementation 19 20 21 remove put 19 20 21

9 Art of Multiprocessor Programming 9 Counting Implementation 19 20 21 Only the counters are sequential remove put 19 20 21

10 Art of Multiprocessor Programming 10 Shared Counter 3 2 1 0 1 2 3

11 Art of Multiprocessor Programming 11 Shared Counter 3 2 1 0 1 2 3 No duplication

12 Art of Multiprocessor Programming 12 Shared Counter 3 2 1 0 1 2 3 No duplication No Omission

13 Art of Multiprocessor Programming 13 Shared Counter 3 2 1 0 1 2 3 Not necessarily linearizable No duplication No Omission

14 Art of Multiprocessor Programming 14 Shared Counters Can we build a shared counter with –Low memory contention, and –Real parallelism? Locking –Can use queue locks to reduce contention –No help with parallelism issue …

15 Art of Multiprocessor Programming 15 Software Combining Tree 4 Contention: All spinning local Parallelism: Potential n/log n speedup

16 Art of Multiprocessor Programming 16 Combining Trees 0

17 Art of Multiprocessor Programming 17 Combining Trees 0 +3

18 Art of Multiprocessor Programming 18 Combining Trees 0 +3 +2

19 Art of Multiprocessor Programming 19 Combining Trees 0 +3 +2 Two threads meet, combine sums

20 Art of Multiprocessor Programming 20 Combining Trees 0 +3 +2 Two threads meet, combine sums +5

21 Art of Multiprocessor Programming 21 Combining Trees 5 +3 +2 +5 Combined sum added to root

22 Art of Multiprocessor Programming 22 Combining Trees 5 +3 +2 0 Result returned to children

23 Art of Multiprocessor Programming 23 Combining Trees 5 0 0 3 0 Results returned to threads

24 Art of Multiprocessor Programming 24 What if? Threads don’t arrive together? –Should I stay or should I go? How long to wait? –Waiting times add up … Idea: –Use multi-phase algorithm –Where threads wait in parallel …

25 Art of Multiprocessor Programming 25 Combining Status enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

26 enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 26 Combining Status Nothing going on

27 enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 27 Combining Status 1 st thread is a partner for combining, will return to check for 2 nd thread

28 enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 28 Combining Status 2 nd thread has arrived with value for combining

29 enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 29 Combining Status 1 st thread has deposited result for 2 nd thread

30 enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 30 Combining Status Special case: root node

31 Art of Multiprocessor Programming 31 Node Synchronization Use “Meta Locking:” Short-term –Synchronized methods –Consistency during method call Long-term –Boolean locked field –Consistency across calls

32 Art of Multiprocessor Programming 32 Phases Precombining –Set up combining rendez-vous

33 Art of Multiprocessor Programming 33 Phases Precombining –Set up combining rendez-vous Combining –Collect and combine operations

34 Art of Multiprocessor Programming 34 Phases Precombining –Set up combining rendez-vous Combining –Collect and combine operations Operation –Hand off to higher thread

35 Art of Multiprocessor Programming 35 Phases Precombining –Set up combining rendez-vous Combining –Collect and combine operations Operation –Hand off to higher thread Distribution –Distribute results to waiting threads

36 Art of Multiprocessor Programming 36 Precombining Phase 0 Examine status IDLE

37 Art of Multiprocessor Programming 37 Precombining Phase 0 0 If IDLE, promise to return to look for partner FIRST

38 Art of Multiprocessor Programming 38 Precombining Phase At ROOT,turn back FIRST 0

39 Art of Multiprocessor Programming 39 Precombining Phase 0 FIRST

40 Art of Multiprocessor Programming 40 Precombining Phase 0 0 SECOND If FIRST, I’m willing to combine, but lock for now

41 Art of Multiprocessor Programming 41 Code Tree class –In charge of navigation Node class –Combining state –Synchronization state –Bookkeeping

42 Art of Multiprocessor Programming 42 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;

43 Art of Multiprocessor Programming 43 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Start at leaf

44 Art of Multiprocessor Programming 44 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Move up while instructed to do so

45 Art of Multiprocessor Programming 45 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Remember where we stopped

46 Art of Multiprocessor Programming 46 Precombining Node synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() }

47 Art of Multiprocessor Programming 47 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Precombining Node Short-term synchronization

48 Art of Multiprocessor Programming 48 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Synchronization Wait while node is locked (in use by earlier combining phase)

49 Art of Multiprocessor Programming 49 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Precombining Node Check combining status

50 Art of Multiprocessor Programming 50 Node was IDLE synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } I will return to look for 2 nd thread’s input value

51 Art of Multiprocessor Programming 51 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Continue up the tree

52 Art of Multiprocessor Programming 52 I’m the 2 nd Thread synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } If 1 st thread has promised to return, lock node so it won’t leave without me

53 Art of Multiprocessor Programming 53 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Prepare to deposit 2 nd thread’s input value

54 Art of Multiprocessor Programming 54 Precombining Node synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } End of precombining phase, don’t continue up tree

55 Art of Multiprocessor Programming 55 Node is the Root synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } If root, precombining phase ends, don’t continue up tree

56 Art of Multiprocessor Programming 56 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Always check for unexpected values!

57 Art of Multiprocessor Programming 57 Combining Phase 0 0 SECOND 2 nd thread locks out 1 st until 2 nd returns with value +3

58 Art of Multiprocessor Programming 58 Combining Phase 0 0 SECOND 2 nd thread deposits value to be combined, unlocks node, & waits … 2 +3 zzz

59 Art of Multiprocessor Programming 59 Combining Phase +3 +2 +5 SECOND 2 0 1 st thread moves up the tree with combined value … zzz

60 Art of Multiprocessor Programming 60 Combining (reloaded) 0 0 2 nd thread has not yet deposited value … FIRST

61 Art of Multiprocessor Programming 61 Combining (reloaded) 0 +3 FIRST 1 st thread is alone, locks out late partner

62 Art of Multiprocessor Programming 62 Combining (reloaded) 0 +3 FIRST Stop at root

63 Art of Multiprocessor Programming 63 Combining (reloaded) 0 +3 FIRST 2 nd thread’s late precombining phase visit locked out

64 Art of Multiprocessor Programming 64 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

65 Art of Multiprocessor Programming 65 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Start at leaf

66 Art of Multiprocessor Programming 66 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Add 1

67 Art of Multiprocessor Programming 67 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Revisit nodes visited in precombining

68 Art of Multiprocessor Programming 68 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Accumulate combined values, if any

69 Art of Multiprocessor Programming 69 node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Combining Navigation We will retraverse path in reverse order …

70 Art of Multiprocessor Programming 70 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Move up the tree

71 Art of Multiprocessor Programming 71 Combining Phase Node synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … }

72 Art of Multiprocessor Programming 72 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node If node is locked by the 2 nd thread, wait until it deposits its value

73 Art of Multiprocessor Programming 73 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node How do we know that no thread acquires the lock between the two lines? Because the methods are synchronized

74 Art of Multiprocessor Programming 74 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Lock out late attempts to combine (by threads still in precombining)

75 Art of Multiprocessor Programming 75 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Remember my (1 st thread) contribution

76 Art of Multiprocessor Programming 76 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node Check status

77 Art of Multiprocessor Programming 77 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Phase Node I (1 st thread) am alone

78 Art of Multiprocessor Programming 78 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } Combining Node Not alone: combine with 2 nd thread

79 Art of Multiprocessor Programming 79 Operation Phase 5 +3 +2 +5 Add combined value to root, start back down zzz

80 Art of Multiprocessor Programming 80 Operation Phase (reloaded) 5 Leave value to be combined … SECOND 2

81 Art of Multiprocessor Programming 81 Operation Phase (reloaded) 5 +2 Unlock, and wait … SECOND 2 zzz

82 Art of Multiprocessor Programming 82 Operation Phase Navigation prior = stop.op(combined);

83 Art of Multiprocessor Programming 83 Operation Phase Navigation prior = stop.op(combined); The node where we stopped. Provide collected sum and wait for combining result

84 Art of Multiprocessor Programming 84 Operation on Stopped Node synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: …

85 Art of Multiprocessor Programming 85 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Op States of Stop Node Only ROOT and SECOND possible. Why?

86 Art of Multiprocessor Programming 86 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … At Root Add sum to root, return prior value

87 Art of Multiprocessor Programming 87 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Deposit value for later combining …

88 Art of Multiprocessor Programming 88 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node (locked in precombining), then notify 1 st thread

89 Art of Multiprocessor Programming 89 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Wait for 1 st thread to deliver results

90 Art of Multiprocessor Programming 90 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node (locked by 1 st thread in combining phase) & return

91 Art of Multiprocessor Programming 91 Distribution Phase 5 0 zzz Move down with result SECOND

92 Art of Multiprocessor Programming 92 Distribution Phase 5 zzz Leave result for 2 nd thread & lock node SECOND 2

93 Art of Multiprocessor Programming 93 Distribution Phase 5 0 zzz Push result down tree SECOND 2

94 Art of Multiprocessor Programming 94 Distribution Phase 5 2 nd thread awakens, unlocks, takes value IDLE 3

95 Art of Multiprocessor Programming 95 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;

96 Art of Multiprocessor Programming 96 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Traverse path in reverse order

97 Art of Multiprocessor Programming 97 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Distribute results to waiting 2 nd threads

98 Art of Multiprocessor Programming 98 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Return result to caller

99 Art of Multiprocessor Programming 99 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default: …

100 Art of Multiprocessor Programming 100 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default: … No 2 nd thread to combine with me, unlock node & reset

101 Art of Multiprocessor Programming 101 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default: … Notify 2 nd thread that result is available (2 nd thread will release lock)

102 Art of Multiprocessor Programming 102 Bad News: High Latency +2 +3 +5 Log n

103 Art of Multiprocessor Programming 103 Good News: Real Parallelism +2 +3 +5 2 threads 1 thread

104 Art of Multiprocessor Programming 104 Throughput Puzzles Ideal circumstances –All n threads move together, combine –n increments in O(log n) time Worst circumstances –All n threads slightly skewed, locked out –n increments in O(n · log n) time

105 Art of Multiprocessor Programming 105 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

106 Art of Multiprocessor Programming 106 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} How many iterations

107 Art of Multiprocessor Programming 107 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Expected time between incrementing counter

108 Art of Multiprocessor Programming 108 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Take a number

109 Art of Multiprocessor Programming 109 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Pretend to work (more work, less concurrency)

110 Performance Here are some graphs Throughput –Average increments in 1 million cycles Latency –Average cycles per inc

111 Performance (Simulated) Art of Multiprocessor Programming 111 Performance

112 112 Load Fluctuations Combining is sensitive: –if arrival rates drop … –So do combining rates … –& performance deteriorates! Test –Vary “work” –Duration between accessess …

113 113 Combining Rate vs Work

114 114 Better to Wait Longer Short wait Indefinite wait Medium wait Latency processors

115 Art of Multiprocessor Programming 115 Conclusions Combining Trees –Linearizable Counters –Work well under high contention –Sensitive to load fluctuations –Can be used for getAndMumble() ops

116 0, 4, 8..... 1, 5, 9..... 2, 6, 10.... 3, 7........ How to coordinate access to counters? Parallel Counter Approach

117 Art of Multiprocessor Programming 117 A Balancer Input wires Output wires

118 Art of Multiprocessor Programming 118 Tokens Traverse Balancers Token i enters on any wire leaves on wire i (mod 2)

119 Art of Multiprocessor Programming 119 Tokens Traverse Balancers

120 Art of Multiprocessor Programming 120 Tokens Traverse Balancers

121 Art of Multiprocessor Programming 121 Tokens Traverse Balancers

122 Art of Multiprocessor Programming 122 Tokens Traverse Balancers

123 Art of Multiprocessor Programming 123 Tokens Traverse Balancers Arbitrary input distribution Balanced output distribution Quiescent State: all tokens have exited

124 Art of Multiprocessor Programming 124 1-smooth property Smoothing Network

125 Art of Multiprocessor Programming 125 step property Counting Network

126 Art of Multiprocessor Programming 126 Counting Networks Count! 0, 4, 8.... 1, 5, 9..... 2, 6,... 3, 7... counters Multiple counters distribute load Step property guarantees no duplication or omissions, how?

127 127 Counting Networks Count! 0 1, 5, 9..... 2, 6,... 3, 7... If 5 and 9 are taken before 4 and 8 Step property guarantees that in-flight tokens will take missing values

128 Art of Multiprocessor Programming 128 Counting Networks Good for counting number of tokens low contention no sequential bottleneck high throughput practical networks depth

129 Art of Multiprocessor Programming 129 Counting Network 1

130 Art of Multiprocessor Programming 130 Counting Network 2 1

131 Art of Multiprocessor Programming 131 Counting Network 3 2 1

132 Art of Multiprocessor Programming 132 Counting Network 3 2 1 4

133 Art of Multiprocessor Programming 133 Counting Network 3 2 1 4 5

134 Art of Multiprocessor Programming 134 Counting Network 3 2 1 4 5

135 Art of Multiprocessor Programming 135 Bitonic[k] Counting Network

136 136 Bitonic[k] Counting Network

137 Art of Multiprocessor Programming 137 Bitonic[k] not Linearizable

138 Art of Multiprocessor Programming 138 Bitonic[k] is not Linearizable

139 Art of Multiprocessor Programming 139 Bitonic[k] is not Linearizable 2

140 Art of Multiprocessor Programming 140 Bitonic[k] is not Linearizable 2 0

141 Art of Multiprocessor Programming 141 Bitonic[k] is not Linearizable 2 0 Problem is: Red finished before Yellow started Red took 2 Yellow took 0

142 Art of Multiprocessor Programming 142 But it is “Quiescently Consistent” Has Step Property in any quiescent State (one in which all tokens have exited)

143 Art of Multiprocessor Programming 143 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }

144 Art of Multiprocessor Programming 144 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } state

145 Art of Multiprocessor Programming 145 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } Output connections to balancers

146 Art of Multiprocessor Programming 146 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } getAndComplement

147 Art of Multiprocessor Programming 147 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; }

148 Art of Multiprocessor Programming 148 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Stop when we exit the network

149 Art of Multiprocessor Programming 149 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Flip state

150 Art of Multiprocessor Programming 150 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Exit on wire

151 Art of Multiprocessor Programming 151 Alternative Implementation: Message-Passing

152 Art of Multiprocessor Programming 152 Bitonic[2k] Inductive Structure Bitonic[k] Merger[2k]

153 Art of Multiprocessor Programming 153 Bitonic[4] Counting Network Merger[4] Bitonic[2]

154 Art of Multiprocessor Programming 154 Bitonic[8] Layout Merger[8] Bitonic[4]

155 Art of Multiprocessor Programming 155 Unfolded Bitonic[8] Network Merger[8]

156 Art of Multiprocessor Programming 156 Unfolded Bitonic[8] Network Merger[4]

157 Art of Multiprocessor Programming 157 Unfolded Bitonic[8] Network Merger[2]

158 Art of Multiprocessor Programming 158 Bitonic[k] Depth Width k Depth is (log 2 k)(log 2 k + 1)/2

159 Proof by Induction Base: –Bitonic[2] is single balancer –has step property by definition Step: –If Bitonic[k] has step property … –So does Bitonic[2k]

160 Art of Multiprocessor Programming 160 Bitonic[2k] Schematic Bitonic[k] Merger[2k]

161 161 Need to Prove only Merger[2k] Merger[2k] Induction HypothesisNeed to prove

162 Art of Multiprocessor Programming 162 Merger[2k] Schematic Merger[k]

163 Art of Multiprocessor Programming 163 Merger[2k] Layout

164 Induction Step –Bitonic[k] has step property … –Assume Merger[k] has step property if it gets 2 inputs with step property of size k/2 and –prove Merger[2k] has step property

165 165 Assume Bitonic[k] and Merger[k] and Prove Merger[2k] Merger[2k] Induction HypothesisNeed to prove

166 Art of Multiprocessor Programming 166 Proof: Lemma 1 If a sequence has the step property …

167 Art of Multiprocessor Programming 167 Lemma 1 So does its even subsequence

168 Art of Multiprocessor Programming 168 Lemma 1 Also its odd subsequence

169 Art of Multiprocessor Programming 169 Lemma 2 Even + odd even odd even Diff at most 1 Odd +

170 Art of Multiprocessor Programming 170 Bitonic[2k] Layout Details Merger[k] Bitonic[k] even odd even Merger[2k]

171 Art of Multiprocessor Programming 171 By induction hypothesis Merger[k] Bitonic[k] Outputs have step property

172 even odd even Merger[k] Art of Multiprocessor Programming 172 By Lemma 1 Merger[k] All subsequences have step property

173 even odd even Merger[k] Art of Multiprocessor Programming 173 By Lemma 2 Merger[k] Diff at most 1

174 Merger[k] Art of Multiprocessor Programming 174 By Induction Hypothesis Merger[k] Outputs have step property

175 Merger[k] Art of Multiprocessor Programming 175 By Lemma 2 Merger[k] At most one diff

176 Art of Multiprocessor Programming 176 Last Row of Balancers Outputs of Merger[k] Outputs of last layer Merger[k]

177 Art of Multiprocessor Programming 177 Last Row of Balancers Merger[k] Wire i from one merger Wire i from other merger

178 Art of Multiprocessor Programming 178 Last Row of Balancers Outputs of Merger[k] Outputs of last layer Merger[k]

179 Art of Multiprocessor Programming 179 Last Row of Balancers Merger[k]

180 Art of Multiprocessor Programming 180 So Counting Networks Count QED Merger[k]

181 Art of Multiprocessor Programming 181 Periodic Network Block

182 Art of Multiprocessor Programming 182 Periodic Network Block

183 Art of Multiprocessor Programming 183 Periodic Network Block

184 Art of Multiprocessor Programming 184 Periodic Network Block

185 Art of Multiprocessor Programming 185 Block[2k] Schematic Block[k]

186 Art of Multiprocessor Programming 186 Block[2k] Layout

187 Art of Multiprocessor Programming 187 Periodic[8]

188 Art of Multiprocessor Programming 188 Network Depth Each block[k] has depth log 2 k Need log 2 k blocks Grand total of (log 2 k) 2

189 Art of Multiprocessor Programming 189 Lower Bound on Depth Theorem: The depth of any width w counting network is at least Ω(log w). Theorem: there exists a counting network of θ(log w) depth. Unfortunately, proof is non-constructive and constants in the 1000s.

190 Art of Multiprocessor Programming 190 Sequential Theorem If a balancing network counts –Sequentially, meaning that –Tokens traverse one at a time Then it counts –Even if tokens traverse concurrently

191 Art of Multiprocessor Programming 191 Red First, Blue Second (2)

192 Art of Multiprocessor Programming 192 Blue First, Red Second (2)

193 Art of Multiprocessor Programming 193 Either Way Same balancer states

194 Art of Multiprocessor Programming 194 Order Doesn’t Matter Same balancer states Same output distribution

195 Art of Multiprocessor Programming 195 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); }

196 Art of Multiprocessor Programming 196 Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput Higher is better!

197 Art of Multiprocessor Programming 197 Performance (Simulated) MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Higher is better!

198 Art of Multiprocessor Programming 198 Performance (Simulated) MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Combining and counting are pretty close

199 Art of Multiprocessor Programming 199 Performance (Simulated) MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition!

200 Art of Multiprocessor Programming 200 Saturation and Performance Undersaturated P < w log w Saturated P = w log w Oversaturated P > w log w Optimal performance

201 Art of Multiprocessor Programming 201 Throughput vs. Size Bitonic[16] Bitonic[4] Bitonic[8] Number processors Throughput

202 Art of Multiprocessor Programming 202 Shared Pool 19 20 21 remove put 19 20 21

203 Art of Multiprocessor Programming 203 What About Decrements Adding arbitrary values Other operations –Multiplication –Vector addition –Horoscope casting …

204 Art of Multiprocessor Programming 204 First Step Can we decrement as well as increment? What goes up, must come down …

205 Art of Multiprocessor Programming 205 Anti-Tokens

206 Art of Multiprocessor Programming 206 Tokens & Anti-Tokens Cancel

207 Art of Multiprocessor Programming 207 Tokens & Anti-Tokens Cancel

208 Art of Multiprocessor Programming 208 Tokens & Anti-Tokens Cancel 

209 Art of Multiprocessor Programming 209 Tokens & Anti-Tokens Cancel As if nothing happened

210 Art of Multiprocessor Programming 210 Tokens vs Antitokens Tokens –read balancer –flip –proceed Antitokens –flip balancer –read –proceed

211 Art of Multiprocessor Programming 211 Pumping Lemma Keep pumping tokens through one wire Eventually, after Ω tokens, network repeats a state

212 Art of Multiprocessor Programming 212 Anti-Token Effect token anti-token

213 Art of Multiprocessor Programming 213 Observation Each anti-token on wire i –Has same effect as Ω-1 tokens on wire i –So network still in legal state Moreover, network width w divides Ω –So Ω-1 tokens

214 Art of Multiprocessor Programming 214 Before Antitoken

215 Art of Multiprocessor Programming 215 Balancer states as if … Ω-1 Ω-1 is one brick shy of a load

216 Art of Multiprocessor Programming 216 Post Antitoken Next token shows up here

217 Art of Multiprocessor Programming 217 Implication Counting networks with –Tokens (+1) –Anti-tokens (-1) Give –Highly concurrent –Low contention getAndIncrement + getAndDecrement methods QED

218 Art of Multiprocessor Programming 218 Adding Networks Combining trees implement –Fetch&add –Add any number, not just 1 What about counting networks?

219 Art of Multiprocessor Programming 219 Fetch-and-add Beyond getAndIncrement + getAndDecrement What about getAndAdd(x)? –Atomically returns prior value –And adds x to value? Not to mention –getAndMultiply –getAndFourierTransform?

220 Art of Multiprocessor Programming 220 Bad News If an adding network –Supports n concurrent tokens Then every token must traverse –At least n-1 balancers –In sequential executions

221 Art of Multiprocessor Programming 221 Uh-Oh Adding network size depends on n –Like combining trees –Unlike counting networks High latency –Depth linear in n –Not logarithmic in w

222 Art of Multiprocessor Programming 222 Generic Counting Network +1 +2 2 2

223 Art of Multiprocessor Programming 223 First Token +1 +2 First token would visit green balancers if it runs solo

224 Art of Multiprocessor Programming 224 Claim Look at path of +1 token All other +2 tokens must visit some balancer on +1 token’s path

225 Art of Multiprocessor Programming 225 Second Token +1 Takes 0 +2

226 Art of Multiprocessor Programming 226 Second Token Takes 0 +2 +1 Takes 0 They can’t both take zero! +2

227 Art of Multiprocessor Programming 227 If Second avoids First’s Path Second token –Doesn’t observe first –First hasn’t run –Chooses 0 First token –Doesn’t observe second –Disjoint paths –Chooses 0

228 Art of Multiprocessor Programming 228 If Second avoids First’s Path Because +1 token chooses 0 –It must be ordered first –So +2 token ordered second –So +2 token should return 1 Something’s wrong!

229 Art of Multiprocessor Programming 229 Second Token +1 +2 Halt blue token before first green balancer +2

230 Art of Multiprocessor Programming 230 Third Token +1 Takes 0 or 2 +2

231 Art of Multiprocessor Programming 231 Third Token +2 +1 Takes 0 +2 Takes 0 or 2 They can’t both take zero, and they can’t take 0 and 2!

232 Art of Multiprocessor Programming 232 First,Second, & Third Tokens must be Ordered Third (+2) token –Did not observe +1 token –May have observed earlier +2 token –Takes an even number

233 Art of Multiprocessor Programming 233 First,Second, & Third Tokens must be Ordered Because +1 token’s path is disjoint –It chooses 0 –Ordered first –Rest take odd numbers But last token takes an even number Something’s wrong!

234 Art of Multiprocessor Programming 234 Third Token +1 +2 Halt blue token before first green balancer

235 Art of Multiprocessor Programming 235 Continuing in this way We can “park” a token –In front of a balancer –That token #1 will visit There are n-1 other tokens –Two wires per balancer –Path includes n-1 balancers!

236 Art of Multiprocessor Programming 236 Theorem In any adding network –In sequential executions –Tokens traverse at least n-1 balancers Same arguments apply to –Linearizable counting networks –Multiplying networks –And others

237 Art of Multiprocessor Programming 237 Shared Pool remove put Depth log 2 w Can we do better?

238 Art of Multiprocessor Programming 238 Counting Trees Single input wire

239 Art of Multiprocessor Programming 239 Counting Trees

240 Art of Multiprocessor Programming 240 Counting Trees

241 Art of Multiprocessor Programming 241 Counting Trees

242 Art of Multiprocessor Programming 242 Counting Trees Step property in quiescent state

243 Counting Trees Interleaved output wires

244 Inductive Construction Tree[2k] has step property in quiescent state. Tree[2k] =...... Tree 0 [k] Tree 1 [k] k even outputs k odd outputs At most 1 more token in top wire

245 Tree 1 [k] Inductive Construction Tree[2k] =...... Tree 0 [k] k even outputs k odd outputs Top step sequence has at most one extra on last wire of step Tree[2k] has step property in quiescent state.

246 Implementing Counting Trees

247 Example

248

249 Implementing Counting Trees Contention Sequential bottleneck

250 Diffraction Balancing If an even number of tokens visit a balancer, the toggle bit remains unchanged! balancer Prism Array

251 Diffracting Tree Diffracting balancer same as balancer. 1 2 k.. : : prism 1 2 k / 2.. prism Diffracting Balancer 1 2 k / 2.. Diffracting Balancer

252 Diffracting Tree 1 2 k.. : : prism 1 2 k / 2.. prism Diffracting Balancer 1 2 k / 2.. Diffracting Balancer High load Lots of Diffraction + Few Toggles Low load Low Diffraction + Few Toggles High Throuhput with Low Contention

253 Performance ThroughputLatency Dtree Ctree Dtree Ctree MCS P=Concurrency 050100150200250300 050100150200250300 120000 160000 60000 40000 80000 0 100000 20000 0 10000 8000 2000 6000 4000 140000 MCS

254 Amdahl’s Law Works 75% Unshared 25% Shared Coarse Grained Fine Grained 75% Unshared 25% Shared Fine grained parallelism gives great performance benefit

255 But… Can we always draw the right conclusions from Amdahl’s law? Claim: sometimes the overhead of fine- grained synchronization is so high…that it is better to have a single thread do all the work sequentially in order to avoid it

256 256 Software Combining Tree n requests in log n time object Tree requires a major coordination effort: multiple CAS operations, cache-misses, etc

257 Oyama et. al Mutex object lock bcd Head a object CAS() Apply a,b,c, and d to object return responses Release lock every request involves CAS

258 Flat Combining Have single lock holder collect and perform requests of all others –Without using CAS operations to coordinate requests –With combining of requests (if cost of k batched operations is less than that of k operations in sequence  we win)

259 Flat-Combining object lock Enq(d) Head object CAS() Apply requests to object Publication list Enq(d ) null Deq() counter 54 125453 Enq(d) Deq() Collect requests Again try to collect requests Most requests do not involve a CAS, in fact, not even a memory barrier

260 Flat-Combining Pub-List Cleanup Enq(d) Head object Publication list Enq(d ) null Deq() counter 54 12 54 53 Enq(d) Every combiner increments counter and updates record’s time stamp when returning response Traverse and remove from list records with old time stamp If thread reappears must add itself to pub list Cleanup requires no CAS, only reads and writes

261 Fine-Grained Lock-free FIFO Queue bcd TailHead a CAS() P: Dequeue() => a Q: Enqueue(d)

262 Flat-Combining FIFO Queue object lock Enq(a) Head CAS() Publication list Enq(b ) null counter 54 1254 Enq(b) Deq()Enq(b) Sequential FIFO Queue bcd TailHead a OK, but can do better…combining: collect all items into a “fat node”, enqueue in one step

263 Flat-Combining FIFO Queue object lock Enq(a) Head CAS() Publication list Enq(b ) counter 54 12 54 Enq(b) Deq() Enq(b) Sequential “Fat Node” FIFO Queue TailHead c b ac b e OK, but can do better…combining: collect all items into a “fat node”, enqueue in one step “Fat Node” easy sequentially but cannot be done in concurrent alg without CAS

264 Linearizable FIFO Queue Flat Combining Combining tree MS queue, Oyama, and Log-Synch

265 Benefits of Flat Combining Flat Combining in Red

266 Linearizable Stack Flat Combining Elimination Stack Treiber Lock- free Stack

267 Concurrent Priority Queue (Chapter 15) k deleteMin operations take O(k*log n) deleteMin() traverses CASing until you manage to mark a node, then use skiplist remove your marked node 1

268 cvccc Flat-Combining Priority Queue object lock Enq(a) Head CAS() Publication list Enq(b ) counter 54 1254 Enq(b) Deq() Enq(b)

269 Flat Combining Priority Queue traverse to find kth key, collect values to be returned Collect k deleteMin requests Traverse skiplist towards kth key, removing all nodes below your path k deleteMin operations take O(k+log n) Remove

270 Priority Queue Flat Combining Skiplist based queue lock-based SkipQueue Whats this? lock-free SkipQueue

271 Priority Queue Flat combining with sequential pairing heap plugged in…

272 Priority Queue on Intel Flat combining with sequential pairing heap plugged in…

273 Don’t be Afraid of the Big Bad Lock Art of Multiprocessor Programming 273 Fine grained parallelism comes with an overhead…not always worth the effort. Sometimes using a single global lock is a win.

274 Art of Multiprocessor Programming 274 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to –http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.


Download ppt "Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit."

Similar presentations


Ads by Google