Shared Counters and Parallelism

Shared Counters and Parallelism
Multiprocessor Synchronization Nir Shavit Spring 2003

Unordered set of objects
A Shared Pool public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects Put Inserts object blocks if full Remove Removes & returns an object blocks if empty 9-May-19 M. Herlihy & N. Shavit (c) 2003

Simple Locking Implementation
Solution: Queue Lock put Problem: hot-spot contention put Problem: sequential bottleneck Solution??? 9-May-19 M. Herlihy & N. Shavit (c) 2003

Counting Implementation
put 19 20 21 19 20 21 remove Only the counters are sequential 9-May-19 M. Herlihy & N. Shavit (c) 2003

Not necessarily linearizable
Shared Counter No duplication No Omission 3 2 1 1 2 3 Not necessarily linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Counters Can we build a shared counter with Locking
Low memory contention, and Real parallelism? Locking Can use queue locks to reduce contention No help with parallelism issue … 9-May-19 M. Herlihy & N. Shavit (c) 2003

Software Combining Tree
P2 1,2,3 4 P3 P3 1 P5 P6 Pn Contention: only local spinning Combine increment requests up tree, wait for winners to propagate answers down Parallelism: high combining rate implies n/log n speed-up 9-May-19 M. Herlihy & N. Shavit (c) 2003

Two threads meet, combine sums
Combining Trees +5 +3 Two threads meet, combine sums +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Combined sum added to root
Combining Trees 5 Combined sum added to root +5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

New value returned to children
Combining Trees 5 New value returned to children 5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Prior values returned to threads
Combining Trees 5 5 Prior values returned to threads 3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tree Node Status FREE COMBINE RESULT ROOT Nothing going on
Waiting to combine RESULT Results ready ROOT Stop here 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: find path up to first COMBINE or ROOT node
5 Lock node, examine status +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: found FREE node
5 Set to COMBINE, unlock, move up FREE +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: found RESULT node
5 Unlock & wait for status change RESULT 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: found COMBINE or ROOT node
5 COMBINE Unlock, start phase two 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Two: revisit path, combine if requested
+5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Three: stopped at COMBINE node
+5 Lock node, deposit value, unlock & wait for my value to be combined by the other thread +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Three: stopped at ROOT
+5 Add my value to root and start back down +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Four: propagate values down
COMBINE 5 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bad News: High Latency Log n +5 +2 +3 9-May-19
M. Herlihy & N. Shavit (c) 2003

Good News: Real Parallelism
1 thread +5 +2 +3 2 threads 9-May-19 M. Herlihy & N. Shavit (c) 2003

Index Distribution Benchmark
void indexBench(int iters, int work) { while (int i < iters) { i = fetch&inc(); Thread.sleep(random() % work); }} Take a number Pretend to work (more work, less concurrency) 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance Benchmarks
Alewife DSM architecture Simulated Throughput: average number of inc operations in 1 million cycle period. Latency: average number of simulator cycles per inc operation. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance Simulated 64 Node Alewife. Work=0 64-leaf combining tree
Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

The Combining Paradigm
Implements any RMW operation When tree is loaded Takes 2 log n steps for n requests Very sensitive to load fluctuations: if the arrival rates drop the combining rates drop overall performance deteriorates! 9-May-19 M. Herlihy & N. Shavit (c) 2003

Combining Load Sensitivity
work = 500 Notice Load Fluctuations 9-May-19 M. Herlihy & N. Shavit (c) 2003

Combining Load Sensitivity
Concurrency W=100 W=1000 W=5000 1 0.0 2 10.3 2.0 0.3 4 26.1 8.5 2.2 8 40.3 19.9 10.6 16 50.4 31.7 32 55.3 39.5 18.5 48 54.2 40.0 15.6 64 65.5 18.7 As work increases, concurrency decreases, and combining levels drop significantly… 9-May-19 M. Herlihy & N. Shavit (c) 2003

Better to Wait Longer Short wait Medium wait Indefinite wait 9-May-19
Figure~\ref{fig: ctree wait} compares combining tree latency when {\tt work} is high using 3 waiting policies: wait 16 cycles, wait 256 cycles, and wait indefinitely. When the number of processors is larger than 64, indefinite waiting is by far the best policy. This follows since an un-combined token message locks later received token messages from progressing until it returns from traversing the root, so a large performance penalty is paid for each un-combined message. Because the chances of combining are good at higher arrival rates we found that when {\tt work = 0}, simulation using more than four processors justify indefinite waiting. Indefinite wait 9-May-19 M. Herlihy & N. Shavit (c) 2003

Can we do better? Centralized: One slow process delays everybody!
Synchronized: Wait for others to bring numbers. Distributed: Spread out work across machine Coordinated: Coordinate but do not wait 9-May-19 M. Herlihy & N. Shavit (c) 2003

. Counting Networks How to coordinate access to counters?
P1 Counting Network C 0, 4, How to coordinate access to counters? P2 . C 1, 5, C 2, 6, C Pn 3, No duplication, No omission 9-May-19 M. Herlihy & N. Shavit (c) 2003

A Balancer Input wires Output wires 9-May-19

Tokens Traverse Balancers
Token i enters on any wire leaves on wire i mod (fan-out) 9-May-19 M. Herlihy & N. Shavit (c) 2003

Formally: A Balancer x0+x1 x0= y0= 7 6 4 2 1 1 3 5 7 2 x1= 5 3 2 4 6 y1= x0+x1 2 Def: a quiescent state is one in which all tokens input on the input wires have exited on the output wires. 9-May-19 M. Herlihy & N. Shavit (c) 2003 quiescent state - when all tokens have exited

Smoothing Network k-smooth property 9-May-19

Counting Network step property 9-May-19

Counting Networks Count!
0, 4, 1, 5, 2, 6,10.... 3, 9-May-19 M. Herlihy & N. Shavit (c) 2003

Counting Networks Good for counting number of tokens low contention
no sequential bottleneck high throughput 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Memory Implementation
1+4k 0/1 0/1 0/1 2+4k 3+4k 0/1 0/1 0/1 4+4k 9-May-19 M. Herlihy & N. Shavit (c) 2003

Message-Passing Implementation

The Bitonic Counting Network
Bitonic[2k] inductive construction: x 1 2 3 4 5 6 7 Merger[2k] y 1 2 3 4 5 6 7 Bitonic[k] Bitonic[k] Need only show how to construct Merger[2k]. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[2k] even odd y0 x y1 y2 y3 y2k-2 y2k-1 Merger[k] z … 9-May-19
1 2 3 4 5 6 K-1 Merger[k] even odd z y0 y1 y2 y3 y2k-2 y2k-1 … 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[4] Bitonic[2] Merger[4] even odd odd even 9-May-19
x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y1 y2 Merger[k] y3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[8] even x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 z0 x 1 2 3 4 5 6 7 y 1 2 3 4 5 6 7 z1 Merger[k] z2 Bitonic[k] z3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] Bitonic[k] even … z2k-2 z2k-1 Theorem: in any quiescent state, the outputs of Bitonic[w] have the step property. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Outputs from Bitonic[k]
Odd-Even has at most one more Proof Outline even odd odd even Outputs from Bitonic[k] Inputs to Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Proof Outline even odd odd even Inputs to Merger[k]
Outputs of Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Proof Outline Outputs of Merger[k] Outputs of last layer 9-May-19

Depth of the Bitonic Network
even x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y0 Bitonic[k] y1 y2 Merger[k] y3 odd … Width w x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Bitonic[k] Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

Unfolded Bitonic Network
Merger[8] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] Depth Width w so depth is
log 2 + log 4 + … + log w/2 + log w = (log w)(log w + 1)/2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Lower Bound on Depth Theorem: The depth of any width w counting network is at least W(log w). Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Network Depth Each block[k] has depth log2 k Need log2 k blocks
Grand total of (log2 k)2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] is not Linearizable

Sequential Theorem If a balancing network counts Then it counts
Sequentially, meaning that Tokens traverse one at a time Then it counts Even if tokens traverse concurrently 9-May-19 M. Herlihy & N. Shavit (c) 2003

Either Way Same balancer states 9-May-19

Same output distribution
Order Doesn’t Matter Same output distribution Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003

Index Distribution Benchmark
void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance (Simulated)
Higher is better! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. 9-May-19 M. Herlihy & N. Shavit (c) 2003

64-leaf combining tree 80-balancer counting network Throughput Higher is better! Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

64-leaf combining tree 80-balancer counting network Combining and counting are pretty close Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

Saturation and Performance
Undersaturated P < w log w Optimal performance Saturated P = w log w Oversaturated P > w log w 9-May-19 M. Herlihy & N. Shavit (c) 2003

Throughput vs. Size Throughput Number processors Bitonic[16]
The simulation was extended to 80 processors, one balancer per processor in The 16 wide network, and as can be seen the counting network scales… Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Counters and Parallelism

Similar presentations

Presentation on theme: "Shared Counters and Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shared Counters and Parallelism

Similar presentations

Presentation on theme: "Shared Counters and Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback