Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

COL 106 Shweta Agrawal and Amit Kumar
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Lecture 3: Parallel Algorithm Design
David Luebke 1 4/22/2015 CS 332: Algorithms Quicksort.
Chapter 4: Trees Part II - AVL Tree
Batcher’s merging network Efficient Parallel Algorithms COMP308.
Lower bound for sorting, radix sort COMP171 Fall 2006.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Binary Heaps CSE 373 Data Structures Lecture 11. 2/5/03Binary Heaps - Lecture 112 Readings Reading ›Sections
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.
Tolerating Faults in Counting Networks Marc D. Riedel Jehoshua Bruck California Institute of Technology Parallel and Distributed.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Bitonic and Merging sorting networks Efficient Parallel Algorithms COMP308.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.
Synchronization (Barriers) Parallel Processing (CS453)
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
Data Structures Week 8 Further Data Structures The story so far  Saw some fundamental operations as well as advanced operations on arrays, stacks, and.
Comparison Networks Sorting Sorting binary values
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CSE373: Data Structures & Algorithms Lecture 10: Disjoint Sets and the Union-Find ADT Lauren Milne Spring 2015.
P p Chapter 10 has several programming projects, including a project that uses heaps. p p This presentation shows you what a heap is, and demonstrates.
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Presenter: Long Ma Advisor: Dr. Zhang 4.5 DISTRIBUTED MUTUAL EXCLUSION.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Self Stabilizing Smoothing and Counting Maurice Herlihy, Brown University Srikanta Tirthapura, Iowa State University.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
CSE373: Data Structures & Algorithms Lecture 9: Disjoint Sets and the Union-Find ADT Lauren Milne Summer 2015.
CS6045: Advanced Algorithms Sorting Algorithms. Heap Data Structure A heap (nearly complete binary tree) can be stored as an array A –Root of tree is.
SkipLists and Balanced Search The Art Of MultiProcessor Programming Maurice Herlihy & Nir Shavit Chapter 14 Avi Kozokin.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Sorting Networks Sorting.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
B-Trees B-Trees.
Lecture 3: Parallel Algorithm Design
Dynamic connection system
Dynamic Graph Partitioning Algorithm
Shared Counters and Parallelism
Shared Counters and Parallelism
Counting, Sorting, and Distributed Coordination
Orthogonal Range Searching and Kd-Trees
Lecture#12: External Sorting (R&G, Ch13)
Lectures on Graph Algorithms: searching, testing and sorting
Bitonic and Merging sorting networks
CSIT 402 Data Structures II With thanks to TK Prasad
Shared Counters and Parallelism
Pools, Counters, and Parallelism
Shared Counters and Parallelism
Shared Counters and Parallelism
B-Trees.
1 Lecture 13 CS2013.
Lecture 18: Coherence and Synchronization
Heaps & Multi-way Search Trees
The Heap ADT A heap is a complete binary tree where each node’s datum is greater than or equal to the data of all of the nodes in the left and right.
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Presentation transcript:

Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR

INTRO What we will discuss today: Shared counting problem Data structures:  Combining Trees  Counting Networks  Diffraction Trees

HOW DO WE MEASURE PERFORMANCE? Latency - The time for an individual method call to complete. Throughput - the overall rate of method calls complete

EXAMPLE

POOLS A data structure containing Put() and Get() Methodes. Problem – A bottleneck for Get() and Put() lock. Solution – A Cyclic array and 2 counters!

YET MORE PROBLEMS How do we prevent memory contention? How do we parallelize the counter++? We need a way to build parallel counters that can spread the indexes the best way possible!

COMBINING TREE A combining tree is a binary tree of nodes The counter is in the root Each thread is assigned a leaf At most two threads share a leaf

ALGORITHM OVERVIEW For each thread that calls GetAndIncrement(): Go up the tree until the root, do count++ and return count. If two threads arrived to a node simultaneously:  The Active thread will go up the tree and update the Counter with the combine value.  The Passive thread will wait for the active to come back. count B BA

EXAMPLE 3 BA

3 A,B BA

EXAMPLE 3 B BA A

5 B BA A B

5 B BA A B A=4 B=3

EXAMPLE 5 BA B B=3 A A=4

EXAMPLE 5 BA A=4 B=3

ADVANTAGES AND DISADVANTAGES Advantages –  Good throughput: O(p) for locking queue and O(logP) for combining tree.  Can be used for any function on the root. Disadvantage – Bad Latency: O(1) for locking queue and O(logP) for combining tree.

NODE IMPLEMENTATION parent Cstatus result first second locked

NODE IMPLEMENTATION 6 properties:  Locked – set to true if the node is locked.  FirstValue – The value of the Active thread.  SecondValue – The value of the Passive thread.  Result – Final combined value.  Parent – Nodes parent node pointer  Cstatus

CSTATUS IDLE : the node is not in use. FIRST : one thread visited. SECOND : a second thread visited. RESULT : both thread’s operations have completed. ROOT : root node.

ADVANCED EXAMPLE - INIT Cstatusresult firstsecond locked R3 Un-locked I I I I I I

Cstatusresult firstsecond locked R3 Un-locked I I I I I I A F F B F S Locked ADVANCED EXAMPLE - PRECOMBINING

Cstatusresult firstsecond locked R3 Un-locked S Locked F Un-locked I F I I B A C S Locked D F Un-locked F

ADVANCED EXAMPLE - COMBINING Cstatusresult firstsecond locked R3 Un-locked S Locked S F Un-locked F F I B A C D F Locked S 1 Un-locked S 11 2 S

ADVANCED EXAMPLE - COMBINING Cstatusresult firstsecond locked R3 Un-locked S 2 S 11 F F Locked F Un-locked I B C D A F Locked F R4 Un-locked E F S 12 3 R7

I I ADVANCED EXAMPLE - DISTRIBUTION Cstatusresult firstsecond locked R7 Un-locked S 12 S 11 F Locked F F F Un-locked D A B C E Return 3

I Un-locked I I ADVANCED EXAMPLE - DISTRIBUTION Cstatusresult firstsecond locked R7 Un-locked S 12 S 11 F Locked F Un-locked A B C E R5 12 Return 4 I5 12 Un-locked R6 11 Return 5 Return 6 D Return 3 I6 11 Un-locked

ALL THAT FOR COUNT++ ?!

PERFORMANCE REVIEW Optimal when threads arrive at the correct time to the leafs, and maximize combining. What happens when contention is low? How log do we wait for another thread to come and combine?

ROBUSTNESS An algorithm is robust, if it performs well in the presence of large fluctuations in request arrival times. Is the Combining Tree a robust algorithm? NO!

MOTIVATION We need an algorithm that can count amount of “tokens” with no consideration to arrival time or order.

INDEX DISTRIBUTION For a set of incoming tokens and W shared counters. How would we like to distribute them among exits? ? i*w + 1 i*w + 2 i*w + 4

THE STEP PROPERTY “No matter how token arrivals are distributed among the input wires, the output distribution is balanced across the output wires, where the top output wires are filled first” A network with this property balances the tokens perfectly

BALANCER A component with 2 entries and 2 exits. Contains a “toggle” button that shows up and down. Every token goes to the exit according to the “toggle” and changes it. The balancer fulfills the step property for w=2.

COUNTING NETWORK A counting network of width k fulfills: Constructed only by balancers k input and output lines Step property

COUNTING NETWORK EXAMPLE

BITONIC[2K] COUNTING NETWORK A kind of counting network with depth 2K. Defined inductively, for any K = power of 2: K=2: A single balancer K > 2 : merge 2 Bitonic[K] networks To a Merger[2K] network

MERGER[2K] Used to merge 2 Bitonic[k] Networks. Defined inductively, for any K = power of 2: K=2: A single balancer. K > 2 : Merge Odd and even outputs of 2 Merger[K] through k balancers. Bitonic Fulfils the step property!

BITONIC NETWORK IMPLEMENTATION Simple enough implementation where the “tokens” are the threads.  Balancer contains a simple toggle switch with 4 pointers (2 entries and 2 exits).

BITONIC NETWORK IMPLEMENTATION  Merger contains a double array with lower order mergers and an array of the current balancers layer.

BITONIC NETWORK IMPLEMENTATION  Bitonic contains a double array with lower order Bitonics and a larger merger. All classes contain a Travers(i) method.

BITONIC NETWORK DEPTH

PERFORMANCE REVIEW Optimal throughput when #threads ≈ #balancers, and all balancers are occupied. Performance improves the more threads there are until it plateaus and descends. The Network is wait-free or lock-free according to the balancers implementation. But Is this really a counting network?

PERIODIC COUNTING NETWORK

BLOCK[2K] Defined inductively, for any K = power of 2: K=2: A single balancer K > 2 : Merge Corresponding outputs through k balancers. Fulfils the step property!

DIFFERENCE

MOTIVATION

DIFFRACTION BALANCER Lets consider a new type of balancer with only one input. This balancer will work the same way and will send a token to wires 0 and 1 alternatively.

TREE[2K] A Binary Tre define as follows: Inductively, for any K = power of 2: K=2: A single diffraction balancer. K > 2 : Merge 2 Tree[K] with one new Root Diffraction Balancer. Top Tree becomes even outputs and bottom tree becomes odd outputs.

STEP PROPERTY But Do we fulfil the “step property”? YES!

PARTIAL PROOF We will prove inductively that outputs are filled from top to bottom mod(w): For k=2: Diffraction Balancer. For K>2: We Assume each Tree[k] has the step property. The outputs are a perfect shuffle of the Tree[k]’s.

REVIEW SO FAR Advantages – Now depth is only O(log(K))! Disadvantages – Bottleneck on the Root node..

ATTEMPTED SOLUTION even Observation – if an even number of tokens pass through a balancer, the outputs are evenly balanced on the top and bottom wires, but the balancer's state remains unchanged. How can we use this to our advantage?

EXCHANGER A data structures that allows T threads to exchange values. Contains a timeout.

PRISM Basically, an array of Exchangers Can only access the array randomly using the visit() method. Visit()- returns a Boolean value according to the exchange that was made or a TimeOutException.

PRISMS Each thread calls visit() and proposes its ThreadID. If an exchange was made, the higher thread goes to the top wire. Else, the thread goes back to toggle the Balancer.

PERFORMANCE REVIEW Depends on two major factors:  Timeout – Small = Misses, Big = Time waste.  Prisms size – Small = Missed Opportunities. Big = Misses. What Are The best Parameters? Set Them Dynamically according to contention! Under optimal parameters, Diffraction Trees are believed to be better than Counting Networks and Combining Trees.