Shared Counters and Parallelism

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Heapsort. 2 Why study Heapsort? It is a well-known, traditional sorting algorithm you will be expected to know Heapsort is always O(n log n) Quicksort.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.
Tolerating Faults in Counting Networks Marc D. Riedel Jehoshua Bruck California Institute of Technology Parallel and Distributed.
Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Heapsort Based off slides by: David Matuszek
Synchronization (Barriers) Parallel Processing (CS453)
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
Self Stabilizing Smoothing and Counting Maurice Herlihy, Brown University Srikanta Tirthapura, Iowa State University.
Skiplist-based Concurrent Priority Queues Itay Lotan Stanford University Nir Shavit Sun Microsystems Laboratories.
Priority Queues Dan Dvorin Based on ‘The Art of Multiprocessor Programming’, by Herlihy & Shavit, chapter 15.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR.
1 Sorting Networks Sorting.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Concurrent Counting is harder than Queuing Costas Busch Rensselaer Polytechnic Intitute Srikanta Tirthapura Iowa State University.
ICDCS 05Adaptive Counting Networks Srikanta Tirthapura Elec. And Computer Engg. Iowa State University.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Chapter 11 Heap.
Topic 3 (Textbook - Chapter 3) Processes
Background on the need for Synchronization
B-Trees B-Trees.
Randomized Smoothing Networks
B+-Trees.
Reactive Synchronization Algorithms for Multiprocessors
Shared Counters and Parallelism
B+-Trees.
Distributed Algorithms (22903)
Shared Counters and Parallelism
Distributed Algorithms (22903)
Heap Sort Example Qamar Abbas.
Distributed Algorithms (22903)
Algebraic Topology and Distributed Computing part two
Counting, Sorting, and Distributed Coordination
Dr. David Matuszek Heapsort Dr. David Matuszek
Improve Run Merging Reduce number of merge passes.
Heapsort.
Barrier Synchronization
Shared Counters and Parallelism
Pools, Counters, and Parallelism
CSE 153 Design of Operating Systems Winter 19
Tree Searching.
Tree Searching.
Heapsort.
Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines
Shared Counters and Parallelism
Heapsort.
Chapter 3: Process Concept
Asynchronous token routing device
Heapsort.
CO 303 Algorithm Analysis and Design
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Heaps.
Presentation transcript:

Shared Counters and Parallelism Multiprocessor Synchronization Nir Shavit Spring 2003

Unordered set of objects A Shared Pool public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects Put Inserts object blocks if full Remove Removes & returns an object blocks if empty 9-May-19 M. Herlihy & N. Shavit (c) 2003

Simple Locking Implementation Solution: Queue Lock put Problem: hot-spot contention put Problem: sequential bottleneck Solution??? 9-May-19 M. Herlihy & N. Shavit (c) 2003

Counting Implementation put 19 20 21 19 20 21 remove Only the counters are sequential 9-May-19 M. Herlihy & N. Shavit (c) 2003

Not necessarily linearizable Shared Counter No duplication No Omission 3 2 1 1 2 3 Not necessarily linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Counters Can we build a shared counter with Locking Low memory contention, and Real parallelism? Locking Can use queue locks to reduce contention No help with parallelism issue … 9-May-19 M. Herlihy & N. Shavit (c) 2003

Software Combining Tree P2 1,2,3 4 P3 P3 1 P5 P6 Pn Contention: only local spinning Combine increment requests up tree, wait for winners to propagate answers down Parallelism: high combining rate implies n/log n speed-up 9-May-19 M. Herlihy & N. Shavit (c) 2003

Two threads meet, combine sums Combining Trees +5 +3 Two threads meet, combine sums +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Combined sum added to root Combining Trees 5 Combined sum added to root +5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

New value returned to children Combining Trees 5 New value returned to children 5 +3 +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Prior values returned to threads Combining Trees 5 5 Prior values returned to threads 3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tree Node Status FREE COMBINE RESULT ROOT Nothing going on Waiting to combine RESULT Results ready ROOT Stop here 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: find path up to first COMBINE or ROOT node 5 Lock node, examine status +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: found FREE node 5 Set to COMBINE, unlock, move up FREE +2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: found RESULT node 5 Unlock & wait for status change RESULT 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase One: found COMBINE or ROOT node 5 COMBINE Unlock, start phase two 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Two: revisit path, combine if requested +5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Three: stopped at COMBINE node +5 Lock node, deposit value, unlock & wait for my value to be combined by the other thread +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Three: stopped at ROOT +5 Add my value to root and start back down +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Phase Four: propagate values down COMBINE 5 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bad News: High Latency Log n +5 +2 +3 9-May-19 M. Herlihy & N. Shavit (c) 2003

Good News: Real Parallelism 1 thread +5 +2 +3 2 threads 9-May-19 M. Herlihy & N. Shavit (c) 2003

Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = fetch&inc(); Thread.sleep(random() % work); }} Take a number Pretend to work (more work, less concurrency) 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance Benchmarks Alewife DSM architecture Simulated Throughput: average number of inc operations in 1 million cycle period. Latency: average number of simulator cycles per inc operation. 9-May-19 M. Herlihy & N. Shavit (c) 2003

The Alewife Topology 9-May-19 M. Herlihy & N. Shavit (c) 2003 * Posters courtesy of the MIT Architecture Group 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance Simulated 64 Node Alewife. Work=0 64-leaf combining tree Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

The Combining Paradigm Implements any RMW operation When tree is loaded Takes 2 log n steps for n requests Very sensitive to load fluctuations: if the arrival rates drop the combining rates drop overall performance deteriorates! 9-May-19 M. Herlihy & N. Shavit (c) 2003

Combining Load Sensitivity work = 500 Notice Load Fluctuations 9-May-19 M. Herlihy & N. Shavit (c) 2003

Combining Load Sensitivity Concurrency W=100 W=1000 W=5000 1 0.0 2 10.3 2.0 0.3 4 26.1 8.5 2.2 8 40.3 19.9 10.6 16 50.4 31.7 32 55.3 39.5 18.5 48 54.2 40.0 15.6 64 65.5 18.7 As work increases, concurrency decreases, and combining levels drop significantly… 9-May-19 M. Herlihy & N. Shavit (c) 2003

Better to Wait Longer Short wait Medium wait Indefinite wait 9-May-19 Figure~\ref{fig: ctree wait} compares combining tree latency when {\tt work} is high using 3 waiting policies: wait 16 cycles, wait 256 cycles, and wait indefinitely. When the number of processors is larger than 64, indefinite waiting is by far the best policy. This follows since an un-combined token message locks later received token messages from progressing until it returns from traversing the root, so a large performance penalty is paid for each un-combined message. Because the chances of combining are good at higher arrival rates we found that when {\tt work = 0}, simulation using more than four processors justify indefinite waiting. Indefinite wait 9-May-19 M. Herlihy & N. Shavit (c) 2003

Can we do better? Centralized: One slow process delays everybody! Synchronized: Wait for others to bring numbers. Distributed: Spread out work across machine Coordinated: Coordinate but do not wait 9-May-19 M. Herlihy & N. Shavit (c) 2003

. Counting Networks How to coordinate access to counters? P1 Counting Network C 0, 4, 8..... How to coordinate access to counters? P2 . C 1, 5, 9..... C 2, 6, 10.... C Pn 3, 7 ........ No duplication, No omission 9-May-19 M. Herlihy & N. Shavit (c) 2003

A Balancer Input wires Output wires 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tokens Traverse Balancers Token i enters on any wire leaves on wire i mod (fan-out) 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tokens Traverse Balancers 9-May-19 M. Herlihy & N. Shavit (c) 2003

Tokens Traverse Balancers Arbitrary input distribution Balanced output distribution 9-May-19 M. Herlihy & N. Shavit (c) 2003

Formally: A Balancer x0+x1 x0= y0= 7 6 4 2 1 1 3 5 7 2 x1= 5 3 2 4 6 y1= x0+x1 2 Def: a quiescent state is one in which all tokens input on the input wires have exited on the output wires. 9-May-19 M. Herlihy & N. Shavit (c) 2003 quiescent state - when all tokens have exited

Smoothing Network k-smooth property 9-May-19 M. Herlihy & N. Shavit (c) 2003

Counting Network step property 9-May-19 M. Herlihy & N. Shavit (c) 2003

Counting Networks Count! 0, 4, 8..... 1, 5, 9..... 2, 6,10.... 3, 7 ........ 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Counting Networks Good for counting number of tokens low contention no sequential bottleneck high throughput 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Memory Implementation 1+4k 0/1 0/1 0/1 2+4k 3+4k 0/1 0/1 0/1 4+4k 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } 9-May-19 M. Herlihy & N. Shavit (c) 2003

Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } 9-May-19 M. Herlihy & N. Shavit (c) 2003

Message-Passing Implementation 9-May-19 M. Herlihy & N. Shavit (c) 2003

The Bitonic Counting Network Bitonic[2k] inductive construction: x 1 2 3 4 5 6 7 Merger[2k] y 1 2 3 4 5 6 7 Bitonic[k] Bitonic[k] Need only show how to construct Merger[2k]. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[2k] even odd y0 x y1 y2 y3 y2k-2 y2k-1 Merger[k] z … 9-May-19 1 2 3 4 5 6 K-1 Merger[k] even odd z y0 y1 y2 y3 y2k-2 y2k-1 … 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[4] Bitonic[2] Merger[4] even odd odd even 9-May-19 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y1 y2 Merger[k] y3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[8] even x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 z0 x 1 2 3 4 5 6 7 y 1 2 3 4 5 6 7 z1 Merger[k] z2 Bitonic[k] z3 odd … x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Merger[k] Bitonic[k] even … z2k-2 z2k-1 Theorem: in any quiescent state, the outputs of Bitonic[w] have the step property. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Merger[2k] Merger[2k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Outputs from Bitonic[k] Odd-Even has at most one more Proof Outline even odd odd even Outputs from Bitonic[k] Inputs to Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Proof Outline even odd odd even Inputs to Merger[k] Outputs of Merger[k] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Proof Outline Outputs of Merger[k] Outputs of last layer 9-May-19 M. Herlihy & N. Shavit (c) 2003

Depth of the Bitonic Network even x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 z 1 2 3 4 5 6 K-1 y0 Bitonic[k] y1 y2 Merger[k] y3 odd … Width w x 1 2 3 4 5 6 K-1 x 1 2 3 4 5 6 K-1 odd z 1 2 3 4 5 6 K-1 Bitonic[k] Merger[k] even … y2k-2 y2k-1 9-May-19 M. Herlihy & N. Shavit (c) 2003

Unfolded Bitonic Network Merger[8] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Unfolded Bitonic Network Merger[4] Merger[4] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Unfolded Bitonic Network Merger[2] Merger[2] Merger[2] Merger[2] 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] Depth Width w so depth is log 2 + log 4 + … + log w/2 + log w = (log w)(log w + 1)/2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Lower Bound on Depth Theorem: The depth of any width w counting network is at least W(log w). Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. 9-May-19 M. Herlihy & N. Shavit (c) 2003

The Periodic Network 9-May-19 M. Herlihy & N. Shavit (c) 2003

Network Depth Each block[k] has depth log2 k Need log2 k blocks Grand total of (log2 k)2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] is not Linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] is not Linearizable 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] is not Linearizable 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] is not Linearizable 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Bitonic[k] is not Linearizable Problem is: Red finished before Yellow started Red took 2 Yellow took 0 2 9-May-19 M. Herlihy & N. Shavit (c) 2003

Sequential Theorem If a balancing network counts Then it counts Sequentially, meaning that Tokens traverse one at a time Then it counts Even if tokens traverse concurrently 9-May-19 M. Herlihy & N. Shavit (c) 2003

Red First, Blue Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)

Blue First, Red Second 9-May-19 M. Herlihy & N. Shavit (c) 2003 (2)

Either Way Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003

Same output distribution Order Doesn’t Matter Same output distribution Same balancer states 9-May-19 M. Herlihy & N. Shavit (c) 2003

Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance (Simulated) Higher is better! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance (Simulated) 64-leaf combining tree 80-balancer counting network Throughput Higher is better! Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance (Simulated) 64-leaf combining tree 80-balancer counting network Combining and counting are pretty close Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

Performance (Simulated) 64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition! Throughput Counting benchmark: Ignore startup and winddown times Spin lock is best at low concurrenncy, drops off rapidly MCS queue lock Spin lock Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003

Saturation and Performance Undersaturated P < w log w Optimal performance Saturated P = w log w Oversaturated P > w log w 9-May-19 M. Herlihy & N. Shavit (c) 2003

Throughput vs. Size Throughput Number processors Bitonic[16] The simulation was extended to 80 processors, one balancer per processor in The 16 wide network, and as can be seen the counting network scales… Number processors 9-May-19 M. Herlihy & N. Shavit (c) 2003