Practical Concerns for Scalable Synchronization

Practical Concerns for Scalable Synchronization
Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

The problem “i++” is dangerous if “i” is global load r1,i inc r1
CPU 0 CPU 1 load r1,i inc r1 store r1,i i

The problem “i++” is dangerous if “i” is global load r1,i inc r1 i i
CPU 0 load r1,i CPU 1 load r1,i load r1,i inc r1 i i store r1,i i

The problem “i++” is dangerous if “i” is global load r1,i inc r1 i+1
CPU 0 inc r1 CPU 1 inc r1 load r1,i inc r1 i+1 i+1 store r1,i i

The problem “i++” is dangerous if “i” is global load r1,i inc r1 i+1
CPU 0 store r1,i CPU 1 store r1,i load r1,i inc r1 i+1 i+1 store r1,i i+1

The solution – critical sections
Classic multiprocessor solution: spinlocks CPU 1 waits for CPU 0 to release the lock Counts are accurate, but locks have overhead! spin_lock(&mylock); i++; spin_unlock(&mylock);

Critical-section efficiency
Lock Acquisition (Ta ) Critical Section (Tc ) Lock Release (Tr ) Tc Critical-section efficiency = Tc+Ta+Tr Ignoring lock contention and cache conflicts in the critical section

Critical section efficiency
Critical Section Size What happened in to cause the drop then improvement in critical section efficiency?

Performance of normal instructions

Questions Have synchronization instructions got faster?
Relative to normal instructions? In absolute terms? What are the implications of this for the performance of operating systems? Can we fix this problem by adding more CPUs?

What’s going on? Taller memory hierarchies
Memory speeds have not kept up with CPU speeds 1984: no caches needed, since instructions were slower than memory accesses 2005: 3-4 level cache hierarchies, since instructions are orders of magnitude faster than memory accesses

Why does this matter?

Why does this matter? Synchronization implies sharing data across CPUs
normal instructions tend to hit in top-level cache synchronization operations tend to miss Synchronization requires a consistent view of data between cache and memory across multiple CPUs requires CPU-CPU communication Synchronization instructions see memory latency!

… but that’s not all! Longer pipelines Out of order execution
1984: Many clock cycles per instruction 2005: Many instructions per clock cycle 20-stage pipelines Out of order execution Keeps the pipelines full Must not reorder the critical section before its lock! Synchronization instructions stall the pipeline!

Reordering means weak memory consistency
Memory barriers - Additional synchronization instructions are needed to manage reordering

What is the cost of all this?
Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon Normal Instruction 1.0 1.0

Atomic increment Instruction Cost 1.45 GHz 3.06GHz Normal Instruction
IBM POWER4 Intel Xeon Normal Instruction Atomic Increment 1.0 183.1 1.0 402.3

Memory barriers Instruction Cost 1.45 GHz 3.06GHz Normal Instruction
IBM POWER4 Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier 1.0 183.1 328.6 328.9 400.9 1.0 402.3 0.0

Lock acquisition/release with LL/SC
Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip 1.0 183.1 328.6 328.9 400.9 1057.5 1.0 402.3 0.0 1138.8

Compare & swap unknown values (NBS)
Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip CAS Cache Transfer & Invalidate 1.0 183.1 328.6 328.9 400.9 1057.5 247.1 1.0 402.3 0.0 1138.8 847.1

Compare & swap known values (spinlocks)
Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip CAS Cache Transfer & Invalidate CAS Blind Cache Transfer 1.0 183.1 328.6 328.9 400.9 1057.5 247.1 257.1 1.0 402.3 0.0 1138.8 847.1 993.9

The net result? 1984: Lock contention was the main issue
2005: Critical section efficiency is a key issue Even if the lock is always free when you try to acquire it, performance can still suck!

How has this affected OS design?
Multiprocessor OS designers search for “scalable” synchronization strategies reader-writer locking instead of global locking data locking and partitioning Per-CPU reader-writer locking Non-blocking synchronization The “common case” is read-mostly access to linked lists and hash-tables asymmetric strategies favouring readers are good

Review - Global locking
A symmetric approach (also called “code locking”) A critical section of code is guarded by a lock Only one thread at a time can hold the lock Examples include Monitors Java “synchronized” on global object Linux spin_lock() on global spinlock_t What is the problem with global locking?

Review - Global locking
A symmetric approach (also called “code locking”) A critical section of code is guarded by a lock Only one thread at a time can hold the lock Examples include Monitors Java “synchronized” on global object Linux spin_lock() on global spinlock_t Global locking doesn’t scale due to lock contention!

Review - Reader-writer locking
Many readers can concurrently hold the lock Writers exclude readers and other writers The result? No lock contention in read-mostly scenarios So it should scale well, right?

Review - Reader-writer locking
Many readers can concurrently hold the lock Writers exclude readers and other writers The result? No lock contention in read-mostly scenarios So it should scale well, right? … wrong!

Scalability of reader/writer locking
CPU 0 CPU 1 memory barrier read-acquire critical section lock Reader/writer locking does not scale due to critical section efficiency!

Review - Data locking A lock per data item instead of one per collection Per-hash-bucket locking for hash tables CPUs acquire locks for different hash chains in parallel CPUs incur memory-latency and pipeline-flush overheads in parallel Data locking improves scalability by executing critical section “overhead” in parallel

Review - Per-CPU reader-writer locking
One lock per CPU (called brlock in Linux) Readers acquire their own CPU’s lock Writers acquire all CPU’s locks In read-only workloads CPUs never exchange locks no memory latency is incurred Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios

Scalability comparison
Expected scalability on read-mostly workloads Global locking – poor due to lock contention R/W locking – poor due to critical section efficiency Data locking – better? R/W data locking – better still? Per-CPU R/W locking – the best we can do?

Actual scalability Scalability of locking strategies using read-only workloads in a hash-table benchmark Measurements taken on a 4-CPU 700 MHz P-III system Similar results are obtained on more recent CPUs

Scalability on 1.45 GHz POWER4 CPUs

Performance at different update fractions on 8 1.45 GHz POWER4 CPUs

What are the lessons so far?

What are the lessons so far?
Avoid lock contention ! Avoid synchronization instructions ! … especially in the read-path !

How about non-blocking synchronization?
Basic idea – copy & flip pointer (no locks!) Read a pointer to a data item Create a private copy of the item to update in place Swap the old item for the new one using an atomic compare & swap (CAS) instruction on its pointer CAS fails if current pointer not equal to initial value Retry on failure NBS should enable fast reads … in theory!

Problems with NBS in practice
Reusing memory causes problems Readers holding references can be hijacked during data structure traversals when memory is reclaimed Readers see inconsistent data structures when memory is reused How and when should memory be reclaimed?

Immediate reclamation?

Immediate reclamation?
In practice, readers must either Use LL/SC to test if pointers have changed, or Verify that version numbers associated with data structures have not changed (2 memory barriers) Synchronization instructions slow NBS readers!

Reader-friendly solutions
Never reclaim memory ? Type-stable memory ? Needs free pool per data structure type Readers can still be hijacked to the free pool Exposes OS to denial of service attacks Ideally, defer reclaiming memory until its safe! Defer reclamation of a data item until references to it are no longer held by any thread

How should we defer reclamation?
Wait for a while then delete? … but how long should you wait? Maintain reference counts or per-CPU “hazard pointers” on data that is in use?

How should we defer reclamation?
Wait for a while then delete? … but how long should you wait? Maintain reference counts or per-CPU “hazard pointers” on data that is in use? Requires synchronization in read path! Challenge – deferring destruction without using synchronization instructions in the read path

Quiescent-state-based reclamation
Coding convention: Don’t allow a quiescent state to occur in a read-side critical section Reclamation strategy: Only reclaim data after all CPUs in the system have passed through a quiescent state Example quiescent states: Context switch in non-preemptive kernel Yield in preemptive kernel Return from system call …

Coding conventions for readers
Delineate read-side critical section rcu_read_lock() and rcu_read_unlock() primitives may compile to nothing on most architectures Don’t hold references outside critical sections Re-traverse data structure to pick up reference Don’t yield the CPU during critical sections Don’t voluntarily yield Don’t block, don’t leave the kernel …

Overview of the basic idea
Writers create (and publish) new versions Using locking or NBS to synchronize with each other Register call-backs to destroy old versions when safe call_rcu() primitive registers a call back with a reclaimer Call-backs are deferred and memory reclaimed in batches Readers do not use synchronization While they hold a reference to a version it will not be destroyed Completion of read-side critical sections is “inferred” by the reclaimer from observation of quiescent states

Overview of RCU API Writer Reader Reclaimer
rcu_dereference () Reader rcu_assign_pointer () Memory Consistency of Mutable Pointers Collection of versions of Immutable Objects Reclaimer rcu_read_lock () call_rcu () synchronize_rcu ()

Context switch as a quiescent state
CPU 0 CPU 1 RCU Read-Side Critical Section Remove Element Context Switch May hold reference Can't hold reference to old version, but RCU can't tell Can't hold reference to old version

Grace periods CPU 0 CPU 1 Grace Period RCU Read-Side Critical Section
Delete Element Context Switch Grace Period

Quiescent states and grace periods
Example quiescent states Context switch (non-preemptive kernels) Voluntary context switch (preemptive kernels) Kernel entry/exit Blocking call Grace periods A period during which every CPU has gone through a quiescent state

Efficient implementation
Choosing good quiescent states They should occur anyway They should be easy to count Not too frequent or infrequent Recording and dispatching call-backs Minimize inter-CPU communication Maintain per-CPU queues of call-backs Two queues – waiting for grace period start and end

RCU's data structures Global CPU Bitmask Global Grace-Period Number
Counter Counter Snapshot Grace-Period Number 'Next' RCU Callbacks 'Current' RCU Callback End of Previous Grace Period (If Any) End of Current Grace Period call_rcu()

RCU implementations DYNIX/ptx RCU (data center) Linux
Multiple implementations (in 2.5 and 2.6 kernels) Preemptible and nonpreemptible Tornado/K42 “generations” Preemptive kernel Helped generalize usage

Experimental results How do different combinations of RCU, SMR, NBS and Locking compare? Hash table mini-benchmark running on 1.45 GHz POWER4 system with 8 CPUs Various workloads Read/update fraction Hash table size Memory constraints Number of CPUs

Scalability with working set in cache

Scalability with large working set

Performance at different update fractions (8 CPUs)

Performance at different update fractions (2 CPUs)

Performance in read-mostly scenarios

Impact of memory constraints

Performance and complexity
When should RCU be used? Instead of simple spinlock? Instead of per-CPU reader-writer lock? Under what environmental conditions? Memory-latency ratio Number of CPUs Under what workloads? Fraction of accesses that are updates Number of updates per grace period

Analytic results Compute breakeven update-fraction contours for RCU vs. locking performance, against: Number of CPUs (n) Updates per grace period () Memory-latency ratio (r) Look at computed memory-latency ratio at extreme values of  for n=4 CPUs

Breakevens for RCU worst case (f vs. r for Small )

Breakeven for RCU best case (f vs. r, Large )

Real-world performance and complexity
SysV IPC >10x on microbenchmark (8 CPUs) 5% for database benchmark (2 CPUs) 151 net lines added to the kernel Directory-Entry Cache +20% in multiuser benchmark (16 CPUs) +12% on SPECweb99 (8 CPUs) -10% time required to build kernel (16 CPUs) 126 net lines added to the kernel

Real-world performance and complexity
Task List +10% in multiuser benchmark (16 CPUs) 6 net lines added to the kernel 13 added 7 deleted

Summary and Conclusions (1)
RCU can provide order-of-magnitude speedups for read-mostly data structures RCU optimal when less than 10% of accesses are updates over wide range of CPUs RCU projected to remain useful in future CPU architectures In Linux 2.6 kernel, RCU provided excellent performance with little added complexity Currently over 1000 uses of RCU API in Linux kernel

Summary and Conclusions (2)
RCU introduces a new model and API for synchronization There is additional complexity Visual inspection of kernel code has uncovered some subtle bugs in use of RCU API primitives Tools to ensure correct use of API primitives are needed

A thought “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it !” Brian Kernighan

Use the right tool for the job!!!

Practical Concerns for Scalable Synchronization

Similar presentations

Presentation on theme: "Practical Concerns for Scalable Synchronization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical Concerns for Scalable Synchronization

Similar presentations

Presentation on theme: "Practical Concerns for Scalable Synchronization"— Presentation transcript:

Similar presentations

About project

Feedback