Operating Systems Engineering Scalable Locks

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.
Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
COMP 740: Computer Architecture and Implementation
תרגול מס' 5: MESI Protocol
Lecture 19: Coherence and Synchronization
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Cache Coherence Protocols:
Lecture 2: Snooping-Based Coherence
Cache Coherence Protocols 15th April, 2006
Designing Parallel Algorithms (Synchronization)
CMSC 611: Advanced Computer Architecture
Lecture 5: Snooping Protocol Design Issues
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Background and Motivation
Multicore programming
Implementing Mutual Exclusion
Lecture 25: Multiprocessors
Lecture 4: Synchronization
Implementing Mutual Exclusion
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
CS333 Intro to Operating Systems
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Operating Systems Engineering Scalable Locks By Dan Tsafrir, 11/5/2011

Lecture topic 20 years old paper Thomas Anderson, “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”, IEEE TPDS, Jan 1990 Is it still useful? (yes…) Context Nowadays hard to buy a single-core laptop We want to use multiple CPUs to improve performance Synchronization between CPUs often expensive (performance-wise) and tricky Scalability limitations of kernel-provided services passed onto all applications Let's consider the x86 memory system to understand performance

Atomicity & Coherency Implementing atomicity: simplistic view of multiprocessor CPUs and memory are connected via the bus Atomicity of x86’s xchg achieved by locking the bus In reality Bus & memory are slow compared to CPU Each CPU has a few levels of cache to compensate L1 access is a few cycles, whereas RAM requires 100s How to ensure caches aren't stale CPU1 writes x=1; CPU2 writes x=2; CPU1 reads x=? Locks don't fix such problems (occur even if not concurrent) We need “Cache coherence protocol” “Consistency model”

Hardware support for coherency / consistency Reminder

The cache coherency problem for a single memory location Memory contents for location X Cache contents for CPU-2 Cache contents for CPU-1 Event Time 1 CPU-1 reads X CPU-2 reads X 2 CPU-1 stores 0 into X 3 Stale value, different than corresponding memory location and CPU-1 cache. (The next read by CPU-2 will yield “1”.)

A memory system is coherent if… Informally, we could say (or we would like to say) that... A memory system is coherent if… Any read of a data item returns the most recently written value of that data item (This definition is intuitive, but overly simplistic) More formally…

A memory system is coherent if… - Processor P writes to location X, and later - P reads from X, and - No other processor writes to X between above write & read => Read must return value previously written by P - P1 writes to X - Some time – T – elapses - P2 reads from X => For big enough T, P2 will read the value written by P1 Two writes to same location by any 2 processors are serialized => Are seen in the same order by all processors (if “1” and then “2” are written, no processor would read “2” & “1”)

A memory system is coherent if… - Processor P writes to location X, and later - P reads from X, and - No other processor writes to X between above write & read => Read must return value previously written by P - P1 writes to X - Some time – T – elapses - P2 reads from X => For big enough T, P2 will read the value written by P1 Two writes to same location by any 2 processors are serialized => Are seen in the same order by all processors (if “1” and then “2” are written, no processor would read “2” & “1”) Simply preserves program order (needed even on uniprocessor). Defines notation of what it means to have a coherent view of memory; if X is never updated regardless of the duration of T, than the memory is not coherent. If P1 writes to X and then P2 writes to X, serialization of writes ensures that every processor will see P2’s write eventually; otherwise P1’s value might be maintained indefinitely.

Memory Consistency The coherency definition is not enough So as to be able to write correct programs It must be supplemented by a consistency model Critical for program correctness Coherency & consistency are 2 different, complementary aspects of memory systems Coherency = What values can be returned by a read. Relates to behavior of reads & writes to the same memory location Consistency = When will a written value be returned by a subsequent read. Relates to behavior of reads & writes to different memory locations

Memory Consistency (cont.) “How consistent is the memory system?” A nontrivial question Assume: locations A & B are originally cached by P1 & P2 With initial value = 0 If writes are immediately seen by other processors Impossible for both “if” conditions to be true Reaching “if” means either A or B must hold 1 But suppose: (1) “Write invalidate” can be delayed, and (2) Processor allowed to compute during this delay => It’s possible P1 & P2 haven’t seen the invalidations of B & A until after the reads, thus, both “if” conditions are true Processor P2 Processor P1 B = 0; A = 0; … B = 1; A = 1; if ( A == 0 ) … if ( B == 0 ) …

Memory Consistency (cont.) “How consistent is the memory system?” A nontrivial question Assume: locations A & B are originally cached by P1 & P2 With initial value = 0 If writes are immediately seen by other processors Impossible for both “if” conditions to be true Reaching “if” means either A or B must hold 1 But suppose: (1) “Write invalidate” can be delayed, and (2) Processor allowed to compute during this delay => It’s possible P1 & P2 haven’t seen the invalidations of B & A until after the reads, thus, both “if” conditions are true Processor P2 Processor P1 B = 0; A = 0; … B = 1; A = 1; if ( A == 0 ) … if ( B == 0 ) … Should this be allowed? => Depends on the consistency model we choose…

Consistency models From most strict to most relaxed Strict consistency Sequential consistency Weak consistency Release consistency […many…] Stricter models are Easier to understand Harder to implement Slower Involve more communication Waste more energy Less scalable

Strict consistency (“linearizability”) Definition: All memory operations are ordered in time Any read to location X returns the most recent write op to X Comments: This is the intuitive notion of memory consistency But too restrictive and thus unused in practice

Sequential consistency Relaxation of strict (defined by Leslie Lamport; also invented latex…) Requires the result of any execution be the same as if memory accesses were executed in some arbitrary order Can be a different order upon each run Left is sequentially consistent (can be ordered as in the right) Q. What if we flip the order of P2’s reads (on left)? W(x)1 P1: R(x)2 R(x)1 P2: P3: W(x)2 P4: W(x)1 P1: R(x)2 R(x)1 P2: P3: W(x)2 P4: time

Weak consistency Access to “synchronization variables” are sequentially consistent No access to a synchronization variable is allowed to be performed until all previous writes have completed No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed In other words, the processor doesn’t need to broadcast values at all, until a synchronization access happens But then it broadcasts all values to all cores S W(x)2 W(x)1 P1: R(x)2 R(x)0 P2: R(x)1 P3: See also: http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html http://www.cc.gatech.edu/classes/AY2008/cs4210_spring/lectures/22-ConsistencyModels.pdf

Release consistency Before accessing shared variable Acquire op must be completed Before a release allowed All accesses must be completed Acquire/release calls are sequentially consistent Serves as “lock”

Protocols to track sharing Directory-based Status of each memory block kept in just 1 location (=directory) Directory-based coherence has bigger overhead But can scale to bigger core counts Snooping Every cache holding a copy of the data has a copy of the state No centralized state All caches are accessible via broadcast (bus or switch) All cache controllers monitor (or “snoop”) the broadcasts To determine if they have a copy of what’s requested

MESI Protocol Each cache line can be on one of 4 states Invalid – Line data is not valid (as in simple cache) Shared – Line is valid & not dirty, copies may exist in other caches Exclusive – Line is valid & not dirty, other processors do not have the line in their local caches Modified – Line is valid & dirty, other processors do not have the line in their local caches (MESI = Modified, Exclusive, Shared, Invalid) Snoopy + achieves sequential consistency

Multi-processor System: Example L1 cache Processor 2 L2 cache (shared) Memory P1 reads 1000 P1 writes 1000 [1000] [1000]: 6 miss M E [1000]: 5 00 10 miss [1000]: 5 [1000]: 5

Multi-processor System: Example L1 cache Processor 2 L2 cache (shared) Memory P1 reads 1000 P1 writes 1000 P2 reads 1000 L2 snoops 1000 P1 writes back 1000 P2 gets 1000 [1000] [1000]: 6 [1000]: 6 S M miss S [1000]: 6 [1000]: 5 11 10 [1000]: 5

Multi-processor System: Example L1 cache Processor 2 L2 cache (shared) Memory P1 reads 1000 P1 writes 1000 P2 reads 1000 L2 snoops 1000 P1 writes back 1000 P2 gets 1000 [1000]: 6 [1000]: 6 [1000]: 6 [1000] I M S E S [1000] [1000]: 6 01 10 11 [1000]: 5 P2 requests for ownership with write intent

Returning to scalable locks Resuming

Locking, atomicity, & coherency Assume Coherent memory Protocol similar to MESI (Modern memory hierarchies more complex) We implement locks out of atomic ops Test-And-Set( addr ) Read-And-Increment( addr ) ‘addr’ is shared memory location How are atomic ops implemented? Reserve bus + read mem + write mem + release bus

Quantifying performance Where does sharing occur? When there are many cores => no shared caches => utilizing bus & memory for synchronization Bus & memory are slow & have limited capacity => They dominate the performance When evaluating locks, what is their performance? Assume N CPUs want the lock at the same time Performance = how long until each CPU gets the lock once Can measure time in bus operations, because, as noted Bus is slow, and It constitutes a shared bottleneck

Test-and-set (t_s_acquire) Generates constant bus traffic while spinning We don't care if waiting CPUs waste their own time But we do care if waiting CPUs slow the lock holder! Intuitively, if bus is fair, a single lock acquire time = O(N), since holder only gets 1/Nth of bus capacity Thus, time for all N to get lock = O(N^2) Actually, it might be even worse (as we’ll see next) Why is this bad? Poor scalability of locks; kernel will become a bottleneck Consequently, our goal is To have less bus traffic from waiters' TAS instructions

Test-&-test-&set (t_t_s_acquire) Idea Spin locally in cache using ordinary read Rather than Test-And-Set To avoid hogging bus and slowing the holder Implementation While holder holds lock Waiters spin with cache line in “S” state What happens on release Cacheline invalidated for waiters They therefore re-read it, run TAS, and someone wins Should be much better than TAS But is it…

Test-&-test-&set (t_t_s_acquire) What happens on release in detail Invalidate => everyone sees "unlocked“ => everyone is going to run TAS => first core to run TAS wins 2nd core runs TAS loses 3rd core runs TAS, invalidates (as “set” => write) => 2nd core re-loads 4th core runs TAS, invalidates => 2nd and 3rd re-load Nth core runs TAS, invalidates => N-2 cores re-load There are up to O(N^2) re-loads for a single release! => Up to O(N^3) for N CPUs to acquire

Test-&-test-&set (t_t_s_acquire) Best case scenario: Winner runs TAS before any core sees the unlock So the other N-1 cores just Re-load (due to the invalidate) But see locked=1 (the best case scenario) So N^2 in total Can we do better? Possible optimizations Can read without bus traffic if already S Can write without bus traffic if already M "write-back"

exponential backoff (t_s_exp_acquire) Goal: avoid everyone jumping in at once Space out attempts to acquire lock Simultaneous attempts were reason for N^3 or N^2 Optimally, upon each trial Only one core will do TAS and will succeed Why not constant delay? Whereby each CPU re-tries after a random delay with a constant average It’s hard to choose delay time too large => waste too small => all N CPUs probe all at once => N^2 – N^3

exponential backoff (t_s_exp_acquire) Why exponential backoff? I.e. why start with small delay, double it? Try to get lucky at first (maybe only a few CPUs attempting) If unlucky, doubling means takes only a few attempts until delay >= N * |crit section| => just one attempt per release Total probes: O(N*logN) Problem Unlikely to be fair Some CPUs will have much lower delays than others will win, and come back, and win again Some CPUs with have huge delays (sit idle doing no harm, but no work)

Ticket locks (ticket_acquire): Linux uses these Goal Fairer than exp backoff Cheaper than t-t-s Idea Assign numbers, wake up one at a time Avoid repeated t-s after every release Why is it fair? Perfect ordering Time analysis: what happens in acquire? N Read-And-Increment (though spread out in steady state) Each is just one bus action (no re-loads after invalidate)

Ticket locks (ticket_acquire): What happens after release? Invalidate N re-loads No atomic instructions, so no repeated re-loads as in t-t-s so N^2 (but NOT N^3) Problem N^2 still grows too fast

Anderson (paper) Can we do better than N^2? Can we get rid of the N re-loads after release? What if each core spins on a different cache line? Acquire cost? N Read-And-Increments Then local spin on per-CPU cache line Release cost? Invalidate someone else's slots element Only they have to re-load No other CPUs involved So O(N) Downside: space cost?

From the paper Make sure you understand Figures 1-3