Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shared Counters and Parallelism

Similar presentations


Presentation on theme: "Shared Counters and Parallelism"— Presentation transcript:

1 Shared Counters and Parallelism
Multiprocessor Synchronization Nir Shavit Spring 2003

2 Unordered set of objects
A Shared Pool public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects Put Inserts object blocks if full Remove Removes & returns an object blocks if empty 9-May-19 M. Herlihy & N. Shavit (c) 2003

3 Simple Locking Implementation
Solution: Queue Lock put Problem: hot-spot contention put Problem: sequential bottleneck Solution??? 9-May-19 M. Herlihy & N. Shavit (c) 2003

4 Counting Implementation
put 19 20 21 19 20 21 remove Only the counters are sequential 9-May-19 M. Herlihy & N. Shavit (c) 2003

5 Not necessarily linearizable
Shared Counter No duplication No Omission 3 2 1 1 2 3 Not necessarily linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003

6 Shared Counters Can we build a shared counter with Locking
Low memory contention, and Real parallelism? Locking Can use queue locks to reduce contention No help with parallelism issue … 9-May-19 M. Herlihy & N. Shavit (c) 2003

7 Software Combining Tree
P2 1,2,3 4 P3 P3 1 P5 P6 Pn Contention: only local spinning Combine increment requests up tree, wait for winners to propagate answers down Parallelism: high combining rate implies n/log n speed-up 9-May-19 M. Herlihy & N. Shavit (c) 2003

8 Two threads meet, combine sums
Combining Trees +5 +3 Two threads meet, combine sums +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

9 Combined sum added to root
Combining Trees 5 Combined sum added to root +5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

10 New value returned to children
Combining Trees 5 New value returned to children 5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

11 Prior values returned to threads
Combining Trees 5 5 Prior values returned to threads 3 9-May-19 M. Herlihy & N. Shavit (c) 2003

12 Tree Node Status FREE COMBINE RESULT ROOT Nothing going on
Waiting to combine RESULT Results ready ROOT Stop here 9-May-19 M. Herlihy & N. Shavit (c) 2003

13 Phase One: find path up to first COMBINE or ROOT node
5 Lock node, examine status +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

14 Phase One: found FREE node
5 Set to COMBINE, unlock, move up FREE +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

15 Phase One: found RESULT node
5 Unlock & wait for status change RESULT 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

16 Phase One: found COMBINE or ROOT node
5 COMBINE Unlock, start phase two 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

17 Phase Two: revisit path, combine if requested
+5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

18 Phase Three: stopped at COMBINE node
+5 Lock node, deposit value, unlock & wait for my value to be combined by the other thread +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

19 Phase Three: stopped at ROOT
+5 Add my value to root and start back down +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

20 Phase Four: propagate values down
COMBINE 5 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

21 Bad News: High Latency Log n +5 +2 +3 9-May-19
M. Herlihy & N. Shavit (c) 2003

22 Good News: Real Parallelism
1 thread +5 +2 +3 2 threads 9-May-19 M. Herlihy & N. Shavit (c) 2003

23 Index Distribution Benchmark
void indexBench(int iters, int work) { while (int i < iters) { i = fetch&inc(); Thread.sleep(random() % work); }} Take a number Pretend to work (more work, less concurrency) 9-May-19 M. Herlihy & N. Shavit (c) 2003

24 Performance Benchmarks
Alewife DSM architecture Simulated Throughput: average number of inc operations in 1 million cycle period. Latency: average number of simulator cycles per inc operation. 9-May-19 M. Herlihy & N. Shavit (c) 2003

25 The Alewife Topology 9-May-19 M. Herlihy & N. Shavit (c) 2003
* Posters courtesy of the MIT Architecture Group 9-May-19 M. Herlihy & N. Shavit (c) 2003

26 Performance Simulated 64 Node Alewife. Work=0 64-leaf combining tree
Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

27 The Combining Paradigm
Implements any RMW operation When tree is loaded Takes 2 log n steps for n requests Very sensitive to load fluctuations: if the arrival rates drop the combining rates drop overall performance deteriorates! 9-May-19 M. Herlihy & N. Shavit (c) 2003

28 Combining Load Sensitivity
work = 500 Notice Load Fluctuations 9-May-19 M. Herlihy & N. Shavit (c) 2003

29 Combining Load Sensitivity
Concurrency W=100 W=1000 W=5000 1 0.0 2 10.3 2.0 0.3 4 26.1 8.5 2.2 8 40.3 19.9 10.6 16 50.4 31.7 32 55.3 39.5 18.5 48 54.2 40.0 15.6 64 65.5 18.7 As work increases, concurrency decreases, and combining levels drop significantly… 9-May-19 M. Herlihy & N. Shavit (c) 2003

30 Better to Wait Longer Short wait Medium wait Indefinite wait 9-May-19
Figure~\ref{fig: ctree wait} compares combining tree latency when {\tt work} is high using 3 waiting policies: wait 16 cycles, wait 256 cycles, and wait indefinitely. When the number of processors is larger than 64, indefinite waiting is by far the best policy. This follows since an un-combined token message locks later received token messages from progressing until it returns from traversing the root, so a large performance penalty is paid for each un-combined message. Because the chances of combining are good at higher arrival rates we found that when {\tt work = 0}, simulation using more than four processors justify indefinite waiting. Indefinite wait 9-May-19 M. Herlihy & N. Shavit (c) 2003

31 Can we do better? Centralized: One slow process delays everybody!
Synchronized: Wait for others to bring numbers. Distributed: Spread out work across machine Coordinated: Coordinate but do not wait 9-May-19 M. Herlihy & N. Shavit (c) 2003

32 . Counting Networks How to coordinate access to counters?
P1 Counting Network C 0, 4, How to coordinate access to counters? P2 . C 1, 5, C 2, 6, C Pn 3, No duplication, No omission 9-May-19 M. Herlihy & N. Shavit (c) 2003

33 A Balancer Input wires Output wires 9-May-19
M. Herlihy & N. Shavit (c) 2003

34 Tokens Traverse Balancers
Token i enters on any wire leaves on wire i mod (fan-out) 9-May-19 M. Herlihy & N. Shavit (c) 2003

35 Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003

36 Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003

37 Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003

38 Tokens Traverse Balancers
9-May-19 M. Herlihy & N. Shavit (c) 2003

39 Tokens Traverse Balancers
Arbitrary input distribution Balanced output distribution 9-May-19 M. Herlihy & N. Shavit (c) 2003

40 Formally: A Balancer x0+x1 x0= y0= 7 6 4 2 1 1 3 5 7 2 x1= 5 3 2 4 6 y1= x0+x1 2 Def: a quiescent state is one in which all tokens input on the input wires have exited on the output wires. 9-May-19 M. Herlihy & N. Shavit (c) 2003 quiescent state - when all tokens have exited

41 Smoothing Network k-smooth property 9-May-19
M. Herlihy & N. Shavit (c) 2003

42 Counting Network step property 9-May-19
M. Herlihy & N. Shavit (c) 2003

43 Counting Networks Count!
0, 4, 1, 5, 2, 6,10.... 3, 9-May-19 M. Herlihy & N. Shavit (c) 2003

44 Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

45 Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

46 Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

47 Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

48 Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

49 Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

50 Counting Networks Good for counting number of tokens low contention
no sequential bottleneck high throughput 9-May-19 M. Herlihy & N. Shavit (c) 2003

51 Shared Memory Implementation
1+4k 0/1 0/1 0/1 2+4k 3+4k 0/1 0/1 0/1 4+4k 9-May-19 M. Herlihy & N. Shavit (c) 2003

52 Shared Memory Implementation
class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } 9-May-19 M. Herlihy & N. Shavit (c) 2003

53 Shared Memory Implementation
Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } 9-May-19 M. Herlihy & N. Shavit (c) 2003

54 Message-Passing Implementation
9-May-19 M. Herlihy & N. Shavit (c) 2003

55 The Bitonic Counting Network
Bitonic[2k] inductive construction: x 1 2 3 4 5 6 7 Merger[2k] y 1 2 3 4 5 6 7 Bitonic[k] Bitonic[k] Need only show how to construct Merger[2k]. 9-May-19 M. Herlihy & N. Shavit (c) 2003

56 Merger[2k] even odd y0 x y1 y2 y3 y2k-2 y2k-1 Merger[k] z … 9-May-19
1 2 3 4 5 6 K-1 Merger[k] even odd z y0 y1 y2 y3 y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

57 Merger[4] Bitonic[2] Merger[4] even odd odd even 9-May-19
x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y1 y2 Merger[k] y3 odd x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] even y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

58 Merger[8] even x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 z0 x 1 2 3 4 5 6 7 y 1 2 3 4 5 6 7 z1 Merger[k] z2 Bitonic[k] z3 odd x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] Bitonic[k] even z2k-2 z2k-1 Theorem: in any quiescent state, the outputs of Bitonic[w] have the step property. 9-May-19 M. Herlihy & N. Shavit (c) 2003

59 Merger[2k] Merger[2k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

60 Outputs from Bitonic[k]
Odd-Even has at most one more Proof Outline even odd odd even Outputs from Bitonic[k] Inputs to Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

61 Proof Outline even odd odd even Inputs to Merger[k]
Outputs of Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

62 Proof Outline Outputs of Merger[k] Outputs of last layer 9-May-19
M. Herlihy & N. Shavit (c) 2003

63 Depth of the Bitonic Network
even x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y0 Bitonic[k] y1 y2 Merger[k] y3 odd Width w x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Bitonic[k] Merger[k] even y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

64 Unfolded Bitonic Network
Merger[8] 9-May-19 M. Herlihy & N. Shavit (c) 2003

65 Unfolded Bitonic Network
Merger[4] Merger[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

66 Unfolded Bitonic Network
Merger[2] Merger[2] Merger[2] Merger[2] 9-May-19 M. Herlihy & N. Shavit (c) 2003

67 Bitonic[k] Depth Width w so depth is
log 2 + log 4 + … + log w/2 + log w = (log w)(log w + 1)/2 9-May-19 M. Herlihy & N. Shavit (c) 2003

68 Lower Bound on Depth Theorem: The depth of any width w counting network is at least W(log w). Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. 9-May-19 M. Herlihy & N. Shavit (c) 2003

69 The Periodic Network 9-May-19 M. Herlihy & N. Shavit (c) 2003

70 Network Depth Each block[k] has depth log2 k Need log2 k blocks
Grand total of (log2 k)2 9-May-19 M. Herlihy & N. Shavit (c) 2003

71 Bitonic[k] is not Linearizable
9-May-19 M. Herlihy & N. Shavit (c) 2003

72 Bitonic[k] is not Linearizable
9-May-19 M. Herlihy & N. Shavit (c) 2003

73 Bitonic[k] is not Linearizable
2 9-May-19 M. Herlihy & N. Shavit (c) 2003

74 Bitonic[k] is not Linearizable
2 9-May-19 M. Herlihy & N. Shavit (c) 2003

75 Bitonic[k] is not Linearizable
Problem is: Red finished before Yellow started Red took 2 Yellow took 0 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

76 Sequential Theorem If a balancing network counts Then it counts
Sequentially, meaning that Tokens traverse one at a time Then it counts Even if tokens traverse concurrently 9-May-19 M. Herlihy & N. Shavit (c) 2003

77 Red First, Blue Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)

78 Blue First, Red Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)

79 Either Way Same balancer states 9-May-19
M. Herlihy & N. Shavit (c) 2003

80 Same output distribution
Order Doesn’t Matter Same output distribution Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003

81 Index Distribution Benchmark
void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } 9-May-19 M. Herlihy & N. Shavit (c) 2003

82 Performance (Simulated)
Higher is better! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. 9-May-19 M. Herlihy & N. Shavit (c) 2003

83 Performance (Simulated)
64-leaf combining tree 80-balancer counting network Throughput Higher is better! Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

84 Performance (Simulated)
64-leaf combining tree 80-balancer counting network Combining and counting are pretty close Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

85 Performance (Simulated)
64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

86 Saturation and Performance
Undersaturated P < w log w Optimal performance Saturated P = w log w Oversaturated P > w log w 9-May-19 M. Herlihy & N. Shavit (c) 2003

87 Throughput vs. Size Throughput Number processors Bitonic[16]
The simulation was extended to 80 processors, one balancer per processor in The 16 wide network, and as can be seen the counting network scales… Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003


Download ppt "Shared Counters and Parallelism"

Similar presentations


Ads by Google