INTRO What we will discuss today: Shared counting problem Data structures: Combining Trees Counting Networks Diffraction Trees
HOW DO WE MEASURE PERFORMANCE? Latency - The time for an individual method call to complete. Throughput - the overall rate of method calls complete
POOLS A data structure containing Put() and Get() Methodes. Problem – A bottleneck for Get() and Put() lock. Solution – A Cyclic array and 2 counters!
YET MORE PROBLEMS How do we prevent memory contention? How do we parallelize the counter++? We need a way to build parallel counters that can spread the indexes the best way possible!
COMBINING TREE A combining tree is a binary tree of nodes The counter is in the root Each thread is assigned a leaf At most two threads share a leaf
ALGORITHM OVERVIEW For each thread that calls GetAndIncrement(): Go up the tree until the root, do count++ and return count. If two threads arrived to a node simultaneously: The Active thread will go up the tree and update the Counter with the combine value. The Passive thread will wait for the active to come back. count B BA
3 A,B BA
5 B BA A B
5 B BA A B A=4 B=3
ADVANTAGES AND DISADVANTAGES Advantages – Good throughput: O(p) for locking queue and O(logP) for combining tree. Can be used for any function on the root. Disadvantage – Bad Latency: O(1) for locking queue and O(logP) for combining tree.
NODE IMPLEMENTATION parent Cstatus result first second locked
NODE IMPLEMENTATION 6 properties: Locked – set to true if the node is locked. FirstValue – The value of the Active thread. SecondValue – The value of the Passive thread. Result – Final combined value. Parent – Nodes parent node pointer Cstatus
CSTATUS IDLE : the node is not in use. FIRST : one thread visited. SECOND : a second thread visited. RESULT : both thread’s operations have completed. ROOT : root node.
ADVANCED EXAMPLE - INIT Cstatusresult firstsecond locked R3 Un-locked I I I I I I
Cstatusresult firstsecond locked R3 Un-locked I I I I I I A F F B F S Locked ADVANCED EXAMPLE - PRECOMBINING
Cstatusresult firstsecond locked R3 Un-locked S Locked F Un-locked I F I I B A C S Locked D F Un-locked F
ADVANCED EXAMPLE - COMBINING Cstatusresult firstsecond locked R3 Un-locked S Locked S F Un-locked F F I B A C D F Locked S 1 Un-locked S 11 2 S
ADVANCED EXAMPLE - COMBINING Cstatusresult firstsecond locked R3 Un-locked S 2 S 11 F F Locked F Un-locked I B C D A F Locked F R4 Un-locked E F S 12 3 R7
I I ADVANCED EXAMPLE - DISTRIBUTION Cstatusresult firstsecond locked R7 Un-locked S 12 S 11 F Locked F F F Un-locked D A B C E Return 3
I Un-locked I I ADVANCED EXAMPLE - DISTRIBUTION Cstatusresult firstsecond locked R7 Un-locked S 12 S 11 F Locked F Un-locked A B C E R5 12 Return 4 I5 12 Un-locked R6 11 Return 5 Return 6 D Return 3 I6 11 Un-locked
PERFORMANCE REVIEW Optimal when threads arrive at the correct time to the leafs, and maximize combining. What happens when contention is low? How log do we wait for another thread to come and combine?
ROBUSTNESS An algorithm is robust, if it performs well in the presence of large fluctuations in request arrival times. Is the Combining Tree a robust algorithm? NO!
MOTIVATION We need an algorithm that can count amount of “tokens” with no consideration to arrival time or order.
INDEX DISTRIBUTION For a set of incoming tokens and W shared counters. How would we like to distribute them among exits? ? i*w + 1 i*w + 2 i*w + 4
THE STEP PROPERTY “No matter how token arrivals are distributed among the input wires, the output distribution is balanced across the output wires, where the top output wires are filled first” A network with this property balances the tokens perfectly
BALANCER A component with 2 entries and 2 exits. Contains a “toggle” button that shows up and down. Every token goes to the exit according to the “toggle” and changes it. The balancer fulfills the step property for w=2.
COUNTING NETWORK A counting network of width k fulfills: Constructed only by balancers k input and output lines Step property
BITONIC[2K] COUNTING NETWORK A kind of counting network with depth 2K. Defined inductively, for any K = power of 2: K=2: A single balancer K > 2 : merge 2 Bitonic[K] networks To a Merger[2K] network
MERGER[2K] Used to merge 2 Bitonic[k] Networks. Defined inductively, for any K = power of 2: K=2: A single balancer. K > 2 : Merge Odd and even outputs of 2 Merger[K] through k balancers. Bitonic Fulfils the step property!
BITONIC NETWORK IMPLEMENTATION Simple enough implementation where the “tokens” are the threads. Balancer contains a simple toggle switch with 4 pointers (2 entries and 2 exits).
BITONIC NETWORK IMPLEMENTATION Merger contains a double array with lower order mergers and an array of the current balancers layer.
BITONIC NETWORK IMPLEMENTATION Bitonic contains a double array with lower order Bitonics and a larger merger. All classes contain a Travers(i) method.
PERFORMANCE REVIEW Optimal throughput when #threads ≈ #balancers, and all balancers are occupied. Performance improves the more threads there are until it plateaus and descends. The Network is wait-free or lock-free according to the balancers implementation. But Is this really a counting network?
BLOCK[2K] Defined inductively, for any K = power of 2: K=2: A single balancer K > 2 : Merge Corresponding outputs through k balancers. Fulfils the step property!
DIFFRACTION BALANCER Lets consider a new type of balancer with only one input. This balancer will work the same way and will send a token to wires 0 and 1 alternatively.
TREE[2K] A Binary Tre define as follows: Inductively, for any K = power of 2: K=2: A single diffraction balancer. K > 2 : Merge 2 Tree[K] with one new Root Diffraction Balancer. Top Tree becomes even outputs and bottom tree becomes odd outputs.
STEP PROPERTY But Do we fulfil the “step property”? YES!
PARTIAL PROOF We will prove inductively that outputs are filled from top to bottom mod(w): For k=2: Diffraction Balancer. For K>2: We Assume each Tree[k] has the step property. The outputs are a perfect shuffle of the Tree[k]’s.
REVIEW SO FAR Advantages – Now depth is only O(log(K))! Disadvantages – Bottleneck on the Root node..
ATTEMPTED SOLUTION even Observation – if an even number of tokens pass through a balancer, the outputs are evenly balanced on the top and bottom wires, but the balancer's state remains unchanged. How can we use this to our advantage?
EXCHANGER A data structures that allows T threads to exchange values. Contains a timeout.
PRISM Basically, an array of Exchangers Can only access the array randomly using the visit() method. Visit()- returns a Boolean value according to the exchange that was made or a TimeOutException.
PRISMS Each thread calls visit() and proposes its ThreadID. If an exchange was made, the higher thread goes to the top wire. Else, the thread goes back to toggle the Balancer.
PERFORMANCE REVIEW Depends on two major factors: Timeout – Small = Misses, Big = Time waste. Prisms size – Small = Missed Opportunities. Big = Misses. What Are The best Parameters? Set Them Dynamically according to contention! Under optimal parameters, Diffraction Trees are believed to be better than Counting Networks and Combining Trees.