Sitting on a Fence: Complexity Implications of Memory Reordering

Sitting on a Fence: Complexity Implications of Memory Reordering
Based on a presentation by Hagit Attiya

Standard Shared-memory Model
Asynchronous processes with unique IDs Apply primitive operations to shared memory processes We consider the standard asynchronous shared memory model, in which processes are fault-free but there is no bound on their relative speeds. Processes communicate by applying atomic read, write or read-modify-write operations to shared memory variables. Shared memory

Per-Thread Ordering is Crucial
Generic mutex algorithm (e.g., Dekker) Process P: Write(X,1) Read(Y) Process Q: Write(Y,1) Read(X) P Q W(Y,1) R(Y) W(X,1) R(X)

Out-of-Order Execution
Compensate for slow writes Most common: issue reads before following writes, if they access different locations processes reordering buffer Shared memory

Read-After-Write (RAW) Reordering Leads to Inconsistency
Process P: Write(X,1) Read(Y) Process Q: Write(Y,1) Read(X) W(X,1) R(Y) W(X,1) P Q W(Y,1) R(X)

Avoiding RAW Reordering: Fences
Process P: Write(X,1) FENCE Read(Y) Process Q: Write(Y,1) Read(X) W(X,1) R(Y) P Q W(Y,1) R(X)

Avoiding RAW Reordering: Atomic Operations
Atomic-write-after-read (AWAR) E.g., CAS, TAS, Fetch&Add,… atomic{ read(Y) … write(X,1) } RAW fences / AWAR are slow But they cannot be avoided

One fence is necessary Result also holds for concurrent data types
Non-commutative operations (queues, counters…) Linearizable solo-terminating implementations Any mutex algorithm must have read-after-write unless it has atomic-write-after-read Attiya, Guerraoui, Hendler, Kuznetsov, Michael, Vechev: Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated. POPL 2011

Proof: Mutex Entry Must Write
Otherwise, does not influence other enters entry entry no shared write

Proof: Mutex Entry Must Write
Otherwise, does not influence other enters An execution where both enters succeed entry entry no shared write

Proof: Must Also Read Otherwise, not influenced by other enters An execution where both enters succeed entry entry no shared read

Close-Up on Entry entry first shared write to X

Close-Up on Entry entry write X no read from Y ≠ X

Covering… Since there is no atomic write-after read contradiction
entry entry write X no read from Y ≠ X contradiction

Is This Tight? Lamport’s Bakery algorithm needs only O(1) fences (one for every write) But many reads And they are remote

Not All Memory Accesses are Equal
Cache coherent (CC) model accesses served from cache: “cheap” Remote Memory References (RMRs): “expensive” Similar DSM model RAW fence still ~60 slower than RMR processes cache cache operation buffer interconnect Shared memory

Bakery algorithm 1: choosing[i] := true 2: number[i] := 1 + max {number[j] | (1  j  n)} 3: choosing[i] := false 4: for j := 1 to n do 5: await choosing[j] = false 6: await (number[j] = 0)  (number[j],j)  (number[i],i) 7: od 8: critical section 9: number[i] := 0 Fences may be required only after write instructions and there are only 4 write operations Similarly to a tournament-tree algorithm, our algorithm also uses a tree structure, but in a different manner. Let us recall how tournament-tree mutex works. A process is statically assigned to a leaf node. It then climbs up the path from its leaf to the root. On each level, it attempts to capture a 2-process lock, by writing, performing a Fence to ensure that the write is visible and then reading to see if another process is around. If required, the process busy waits on the node until it captures the lock. This is quite similar to Dekker’s algorithm applied on each level.

Tournament-tree: entry section
Write Similarly to a tournament-tree algorithm, our algorithm also uses a tree structure, but in a different manner. Let us recall how tournament-tree mutex works. A process is statically assigned to a leaf node. It then climbs up the path from its leaf to the root. On each level, it attempts to capture a 2-process lock, by writing, performing a Fence to ensure that the write is visible and then reading to see if another process is around. If required, the process busy waits on the node until it captures the lock. This is quite similar to Dekker’s algorithm applied on each level.

Write

Fences vs. RMRs Ω(log n) RMRs lower bound for (deterministic) mutex
[Attiya, Hendler and Woelfel, STOC 2008] Can we get the best of both worlds? Fences RMRs Θ(log n) Tournament [Yang, Anderson] O(1) Θ(n) Bakery [Lamport] Without write reordering Θ(log n) O(1) With write reordering NO NO Shared-memory mutual exclusion research over the last 20 years or so focused on the RMR complexity metric. In this work, we take into consideration both the RMRs and the Fences complexity of shared-memory mutual exclusion. Let us recall two well-known mutual exclusion algorithms. Yang and Anderson's mutual exclusion algorithm was the first read/write algorithm in which each passage through the critical section incurs only a logarithmic number of RMRs, where n is the number of processes. Their algorithm is essentially a tournament tree algorithm. [Their algorithm , which is essentially a tournament tree algorithm, provides this complexity not only under the cache coherent, but also under the distributed shared memory model. ] Attiya et al. proved that this algorithm is optimal in terms of RMR complexity. What about Fence complexity? Well, it is easy to see that in order to ensure correctness in TSO systems, this algorithm also requires log n Fences, since a Fence is required at every tree level.

Total Store Ordering (TSO) Model
Just read-after-write reordering, e.g., Intel x86 A fence flushes the write buffer, in-order write write Processor Cache Write Buffer Main Memory Read/write reordering read Now, most shared-memory algorithms assume that shared memory is linearizable, or at least sequentially consistent. Sequential consistency, defined by Lamport, requires that, in any execution, the observed behavior is such that process steps appear to be in some sequential order, in which the steps of any individual process appear in program order. However, modern multiprocessors are NOT sequentially consistent. They support more relaxed memory models that may execute process steps out of program order in order to optimize performance. Here is an example of such an optimization. A write buffer is used to hold data being written back from the cache to main memory. When data has been written to the write buffer, the processor may proceed executing reads following the write in program order, even before the write is actually completed.[The write requires establishing ownership and invalidating other caches]. This may effectively reorder a write and a preceding read, thus hiding the latency of the write operation and improving performance. But such read/write reordering may, in some cases, violate program correctness. Let’s see an example. Attiya, Hendler, Levy: An O(1)-Fences optimal RMRs mutual exclusion algorithm. PODC 2013

Our Algorithm: Entry Section
Write The entry section synchronization pattern of our algorithm is quite different. Starting from its leaf, a process only WRITES to all the nodes along the path from its leaf to the root.

Write Write

CAS Write Write Write Only after completing all these writes, the process performs a SINGLE Fence and then attempts to capture a global lock by performing a CAS operation. If it fails, then it needs to busy wait until it may enter the critical section. The entry section is actually slightly more complicated, as we’ll soon see.

Our Algorithm: Promoting Your Peers
When exiting the critical section, processes look around to see who’s waiting and promote them Place in a queue of waiting processes Ensure that waiting processes are promoted, and hence, not starved What about the exit section? Well, similarly to some previous algorithms, processes that exit the critical section try to help waiting processes, making sure they will be able to enter the critical section even when they failed capturing the lock by themselves. We say that such processes are BEING PROMOTED. A process that exits the CS may promote processes whose identifiers it reads along the path from the root to its leaf. So the promotion mechanism is what ensures starvation-freedom, as it guarantees that processes will never wait indefinitely.

New Algorithm: Data Structures
lock{P,} Promotion Queue pi1 pi2 pik apply[1…n] signal[1…n] exits inPromQ[1…n] Before taking a closer look at the exit section, let me describe the data-structures used by our algorithm. First, there is a global lock which can store either the identifier of a process currently holding the lock, or a special value indicating that the lock is free. We also have a promotion queue, which is a queue of processes that have been promoted. The algorithm guarantees that processes in this queue will enter the critical section, one after the other, according to the queue order. Since operations on the queue are only applied by a process in its exit section, it is always accessed in a sequential manner. A process will be promoted only if it APPLIES for promotion. A process indicates its wish to be promoted by setting its flag in the apply array, which is done in the beginning of its entry section. A process that was promoted will enter the critical section once it is SIGNALLED, and this is done by writing to that process’ entry in the signal array. Finally, the exits variable counts the number of times processes have exited the CS since the execution started. The reason why this variable is required will become apparent soon.

New Algorithm: Exit section
p Promotion Queue q r s t q p So here is an animation showing how the exit section works… A process in its exit section descends down the path from the root to its leaf. For each node along the path, p reads the identifiers written at the node and its two children. Process p will promote any of these processes that applied for promotion and is not already in the promotion queue. Clearly, p will not promote itself. s

New Algorithm: Exit section
p Promotion Queue t q r s t q p So, in this example, assuming that both t and q applied for promotion, only t will be promoted since q is already in the promotion queue. s

Exit section: Scenario 1
p Promotion Queue t q r s t q p After doing what promotions were necessary, process p checks the promotion queue. There are two scenarios here. If the queue is not empty, then p dequeues the first process in the queue, say process s. s

Promotion Queue t q r t q p Now, process p “hands” the lock to process s by simply writing its identifier to the lock variable. s

Promotion Queue t q r t s CS q p await (signal) =true) Finally, p signals process s that it may now enter the critical section, by writing true to the appropriate entry of the signal array. When process s will next read this entry, it will enter the critical section. s

p Promotion Queue The other scenario is that the promotion queue is empty. In this case, p simply releases the lock by writing NULL.

Promotion Queue

New Algorithm: Entry section pseudo-code
Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) Initialize signal entry Let’s go over the pseudo-code of the entry section in some more detail in order to understand a subtle point regarding its implementation. First, a process initializes its signal entry.

Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) Please promote me! Then, it indicates that it is eligible for promotion

Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) Ensure visibility Then p writes its identifier to all nodes along the path to the root and performs a single barrier.

Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) Now p attempts to capture the lock. If it succeeds, it enters the critical section. Attempt capturing lock

Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) Wait before re-trying Otherwise, p busy wait until either the lock is handed to it (the lock would equal p in this case) OR the lock becomes free, OR p is ensured that some process completed a full exit section after p’s writes became visible. This is why the exits counter is required. X

Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) Attempt capturing lock. If failed, await signal Once the condition of line 7 is satisfied, process p performs another CAS operation. If it fails also this time, it awaits in line 9 until it is signaled. As we prove in the paper, it is guaranteed that p will eventually be signaled. But why is the check in line 7 required? Why couldn’t a process that fails the CAS of line 5 simply await for a signal?

Entry section for process p signal[p]  false apply[p]  true For each node n on the path from leaf to root n  p Barrier if (CAS(lock, , p) ≠ ) e  exits await ((exits – e ≥ 2) V lock  {p, }) await (signal[p]) What if we remove lines 6-8? So assume that the algorithm would do that, then the following race condition would be possible. Process p fails the CAS because some other process q holds the lock But q already started its exit section and have descended down its path from the root to the leaf and is already BELOW the node where its path intersects with the path of p. In this case, q will not observe p and p may starve. In order to prevent this race condition, lines 6-8 guarantee that if p fails to capture the lock on line 8, then p will either succeed to capture the lock on line 8, or some other process observed it and p will eventually be promoted.

New Algorithm: Exit section pseudo-code
Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Not applying for help

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Increment exits counter

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Promote processes along path

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Parent and both children

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Do for each

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier If it applies for promotion and not in queue

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Promote q and let it know it was promoted

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Release lock if promotion queue is empty

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Otherwise, extract ID of next process to promote

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier It is no longer in promotion queue

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Hand lock to promoted process

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Signal promoted process to enter CS

Exit section for process p apply[p]  false exits  exits+1 foreach node n on the path from root to Lp' parent q1  n, q2  n.left, q3  n.right foreach q  {q1,q2,q3} if apply[q] ∧ ¬inPromQ[q] promQ.enqueue(q) inPromQ[q]  true if promQ.isEmpty() lock ← −1 Else next ← Q.dequeue() inPromQ[next] ← false lock ← next signal[next]  true Barrier Ensure changes are visible to others

Argument for starvation-freedom
p s Why is it guaranteed that process p will be promoted ? p r p q It is easy to see that the algorithm is deadlock-free, but why can’t a process starve? A process busy-waits either in line 7 or in line 9. A process may not wait indefinitely in line 7, because eventually either the lock is handed to it or the exit section is performed at least twice and then the condition of line 7 is satisfied. The key point in the starvation-freedom argument is to understand why, if p awaits in line 9, it is guaranteed that it will be eventually signaled. This is not immediately clear, since p may be overwritten, possibly by different processes, on all nodes on the path to the root, except for its leaf. p

Assume a set of processes starve in some execution. For a starved process p, let lp be the highest level in which p is visible r Level 0 p Level 1 q Level 2 For process p, we let lp denote the highest level at which p’s ID is written. So here, for example, lp equals 1. r q p Level 3

lp is non-decreasing, and stabilizes for all starved processes after some execution prefix E r Level 0 p Level 1 q Level 2 If p is waiting, then lp can only INCREASE, as p’s ID may be overwritten as the execution unfolds and may not DECREASE while it waits. However, p is always visible at its leaf. Assume there is an execution in which a set of processes starve, then it is not difficult to see that there must be a prefix of this execution after which values lp become stable, for all starved processes p. r q p Level 3

Prove by induction on lp (w.r.t E) that processes do not starve Level 0 Level 1 Level 2 We now consider this prefix and show by induction on lp that p does not starve. The base case, which is the root, is simple so let’s focus on the inductive claim. Level 3

Assume the claim holds for all levels up to k and let lp = k+1 TSO guarantees that p's writes become visible bottom up. p Level 0 p Level k p Level k+1 Assume the claim holds for all levels up to level k. From TSO, p’s writes become visible in bottom up order. p

Assume the claim holds for all levels up to k and let lp = k+1 Fence ensures p's writes become visible before its CAS p Level 0 p Level k p Level k+1 From the fence, p’s writes become visible before p performs its CAS. In some models, the CAS will include a fence p

Let q be the last process to overwrite p on level k q r p Level 0 p Level k p Level k+1 Let q be the last process to overwrite p’s ID on level k, where lp equals k. From TSO, when q overwrites p on level k, p is already visible on level k+1 and will remain visible until the end of the execution. p

By induction, q cannot be starved and hence, it promotes p in its exit section q r p Level 0 p Level k p Level k+1 FROM IH, q cannot be starved, hence q eventually enters the critical section and the algorithm and TSO guarantees that when q exits its critical section it will read p’s ID and will promote it. p

Sitting on a Fence: Complexity Implications of Memory Reordering

Similar presentations

Presentation on theme: "Sitting on a Fence: Complexity Implications of Memory Reordering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sitting on a Fence: Complexity Implications of Memory Reordering

Similar presentations

Presentation on theme: "Sitting on a Fence: Complexity Implications of Memory Reordering"— Presentation transcript:

Similar presentations

About project

Feedback