Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

Slides:

Advertisements

Similar presentations

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Advertisements

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.

Chapter 6: Process Synchronization

CS510 Concurrent Systems Jonathan Walpole. What is RCU, Fundamentally? Paul McKenney and Jonathan Walpole.

RCU in the Linux Kernel: One Decade Later by: Paul E. Mckenney, Silas Boyd-Wickizer, Jonathan Walpole Slides by David Kennedy (and sources)

Kernel-Kernel Communication in a Shared- memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010.

Read-Copy Update P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Russell, D. Saram, M. Soni Ottawa Linux Symposium 2001 Presented by Bogdan Simion.

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

3.5 Interprocess Communication

What is RCU, fundamentally? Sri Ramkrishna. Intro RCU stands for Read Copy Update  A synchronization method that allows reads to occur concurrently with.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.

Relativistic Red Black Trees. Relativistic Programming Concurrent reading and writing improves performance and scalability – concurrent readers may disagree.

CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.

Cosc 4740 Chapter 6, Part 3 Process Synchronization.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Kernel Locking Techniques by Robert Love presented by Scott Price.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

CS333 Intro to Operating Systems Jonathan Walpole.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

CS533 Concepts of Operating Systems Jonathan Walpole.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

1 Read-Copy Update Paul E. McKenney Linux Technology Center IBM Beaverton Jonathan Appavoo Department.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,

1 Exploiting Deferred Destruction: An Analysis of Read-Copy-Update Techniques in Operating System Kernels Paul E. McKenney OGI School of Science & Technology.

CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.

Read-Copy-Update Synchronization in the Linux Kernel 1 David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis.

CS510 Concurrent Systems “What is RCU, Fundamentally?” By Paul McKenney and Jonathan Walpole Daniel Mansour (Slides adapted from Professor Walpole’s)

Advanced Operating Systems (CS 202) Memory Consistency and lock-free synchronization (RCU) Jan, 27, 2016.

Jonathan Walpole Computer Science Portland State University

By Michael Greenwald and David Cheriton Presented by Jonathan Walpole

Background on the need for Synchronization

Atomic Operations in Hardware

Atomic Operations in Hardware

Department of Computer Science University of California, Santa Barbara

Kernel Synchronization II

Practical Concerns for Scalable Synchronization

Kernel Synchronization II

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Chapter 6: Synchronization Tools

CS703 - Advanced Operating Systems

Department of Computer Science University of California, Santa Barbara

Lecture 18: Coherence and Synchronization

CSC Multiprocessor Programming, Spring, 2011

MV-RLU: Scaling Read-Log-Update with Multi-Versioning

Presentation transcript:

Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

2 Jonathan Walpole CS533 Winter 2006 The problem “i++” is dangerous if “i” is global CPU 0 load r1,i inc r1 store r1,i i CPU 0

3 Jonathan Walpole CS533 Winter 2006 The problem “i++” is dangerous if “i” is global CPU 0 load r1,i inc r1 store r1,i i i CPU 0 load r1,i i

4 Jonathan Walpole CS533 Winter 2006 The problem “i++” is dangerous if “i” is global CPU 0 inc r1 load r1,i inc r1 store r1,i i i+1 CPU 0 inc r1 i+1

5 Jonathan Walpole CS533 Winter 2006 The problem “i++” is dangerous if “i” is global CPU 0 store r1,i load r1,i inc r1 store r1,i i+1 CPU 0 store r1,i i+1

6 Jonathan Walpole CS533 Winter 2006 Question What is this problem called?

7 Jonathan Walpole CS533 Winter 2006 What is this problem called? What solution could we apply? Question

8 Jonathan Walpole CS533 Winter 2006 The solution – critical sections Classic multiprocessor solution: spinlocks – CPU 1 waits for CPU 0 to release the lock Counts are accurate, but locks have overhead! spin_lock(&mylock); i++; spin_unlock(&mylock);

9 Jonathan Walpole CS533 Winter 2006 Question What are spinlocks built from?

10 Jonathan Walpole CS533 Winter 2006 Critical-section efficiency Lock Acquisition (T a ) Critical Section (T c ) Lock Release (T r ) Critical-section efficiency = TcTc T c +T a +T r Ignoring lock contention and cache conflicts in the critical section

11 Jonathan Walpole CS533 Winter 2006 Critical section efficiency Critical Section Size

12 Jonathan Walpole CS533 Winter 2006 Performance of normal instructions

13 Jonathan Walpole CS533 Winter 2006 Question Have synchronization instructions got faster? – Relative to normal instructions? – In absolute terms?

14 Jonathan Walpole CS533 Winter 2006 Have synchronization instructions got faster? – Relative to normal instructions? – In absolute terms? What are the implications of this for the performance of operating systems? Questions

15 Jonathan Walpole CS533 Winter 2006 Have synchronization instructions got faster? – Relative to normal instructions? – In absolute terms? What are the implications of this for the performance of operating systems? Can we fix this problem by adding more CPUs? Questions

16 Jonathan Walpole CS533 Winter 2006 What’s going on? Taller memory hierarchies – Memory speeds have not kept up with CPU speeds – 1984: no caches needed, since instructions were slower than memory accesses – 2005: 3-4 level cache hierarchies, since instructions are orders of magnitude faster than memory accesses

17 Jonathan Walpole CS533 Winter 2006 Why does this matter?

18 Jonathan Walpole CS533 Winter 2006 Why does this matter? Synchronization implies sharing data across CPUs – normal instructions tend to hit in top-level cache – synchronization operations tend to miss Synchronization requires a consistent view of data – between cache and memory – across multiple CPUs – requires CPU-CPU communication Synchronization instructions see memory latency!

19 Jonathan Walpole CS533 Winter 2006 … but that’s not all! Longer pipelines – 1984: Many clock cycles per instruction – 2005: Many instructions per clock cycle ● 20-stage pipelines Out of order execution – Keeps the pipelines full – Must not reorder the critical section before its lock! Synchronization instructions stall the pipeline!

20 Jonathan Walpole CS533 Winter 2006 Reordering means weak memory consistency Memory barriers - Additional synchronization instructions are needed to manage reordering

21 Jonathan Walpole CS533 Winter 2006 What is the cost of all this? Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction1.0

22 Jonathan Walpole CS533 Winter 2006 Atomic increment Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment

23 Jonathan Walpole CS533 Winter 2006 Memory barriers Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier

24 Jonathan Walpole CS533 Winter 2006 Lock acquisition/release with LL/SC Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip

25 Jonathan Walpole CS533 Winter 2006 Compare & swap unknown values (NBS) Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip CAS Cache Transfer & Invalidate

26 Jonathan Walpole CS533 Winter 2006 Compare & swap known values (spinlocks) Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip CAS Cache Transfer & Invalidate CAS Blind Cache Transfer

27 Jonathan Walpole CS533 Winter 2006 The net result? 1984: Lock contention was the main issue 2005: Critical section efficiency is a key issue Even if the lock is always free when you try to acquire it, performance can still suck!

28 Jonathan Walpole CS533 Winter 2006 How has this affected OS design? Multiprocessor OS designers search for “scalable” synchronization strategies – reader-writer locking instead of global locking – data locking and partitioning – Per-CPU reader-writer locking – Non-blocking synchronization The “common case” is read-mostly access to linked lists and hash-tables – asymmetric strategies favouring readers are good

29 Jonathan Walpole CS533 Winter 2006 Review - Global locking A symmetric approach (also called “code locking”) – A critical section of code is guarded by a lock – Only one thread at a time can hold the lock Examples include – Monitors – Java “synchronized” on global object – Linux spin_lock() on global spinlock_t What is the problem with global locking?

30 Jonathan Walpole CS533 Winter 2006 Review - Global locking A symmetric approach (also called “code locking”) – A critical section of code is guarded by a lock – Only one thread at a time can hold the lock Examples include – Monitors – Java “synchronized” on global object – Linux spin_lock() on global spinlock_t Global locking doesn’t scale due to lock contention!

31 Jonathan Walpole CS533 Winter 2006 Review - Reader-writer locking Many readers can concurrently hold the lock Writers exclude readers and other writers The result? – No lock contention in read-mostly scenarios – So it should scale well, right?

32 Jonathan Walpole CS533 Winter 2006 Review - Reader-writer locking Many readers can concurrently hold the lock Writers exclude readers and other writers The result? – No lock contention in read-mostly scenarios – So it should scale well, right? – … wrong!

33 Jonathan Walpole CS533 Winter 2006 Scalability of reader/writer locking CPU 0 CPU 1 read-acquire memory barrier read-acquire memory barrier read-acquire memory barrier critical section critical section lock Reader/writer locking does not scale due to critical section efficiency!

34 Jonathan Walpole CS533 Winter 2006 Review - Data locking A lock per data item instead of one per collection – Per-hash-bucket locking for hash tables – CPUs acquire locks for different hash chains in parallel – CPUs incur memory-latency and pipeline-flush overheads in parallel Data locking improves scalability by executing critical section “overhead” in parallel

35 Jonathan Walpole CS533 Winter 2006 Review - Per-CPU reader-writer locking One lock per CPU (called brlock in Linux) – Readers acquire their own CPU’s lock – Writers acquire all CPU’s locks In read-only workloads CPUs never exchange locks – no memory latency is incurred Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios

36 Jonathan Walpole CS533 Winter 2006 Scalability comparison Expected scalability on read-mostly workloads – Global locking – poor due to lock contention – R/W locking – poor due to critical section efficiency – Data locking – better? – R/W data locking – better still? – Per-CPU R/W locking – the best we can do?

37 Jonathan Walpole CS533 Winter 2006 Actual scalability Scalability of locking strategies using read- only workloads in a hash-table benchmark Measurements taken on a 4-CPU 700 MHz P-III system Similar results are obtained on more recent CPUs

38 Jonathan Walpole CS533 Winter 2006 Scalability on 1.45 GHz POWER4 CPUs

39 Jonathan Walpole CS533 Winter 2006 Performance at different update fractions on GHz POWER4 CPUs

40 Jonathan Walpole CS533 Winter 2006 What are the lessons so far?

41 Jonathan Walpole CS533 Winter 2006 Avoid lock contention ! Avoid synchronization instructions ! – … especially in the read-path ! What are the lessons so far?

42 Jonathan Walpole CS533 Winter 2006 How about non-blocking synchronization? Basic idea – copy & flip pointer (no locks!) – Read a pointer to a data item – Create a private copy of the item to update in place – Swap the old item for the new one using an atomic compare & swap (CAS) instruction on its pointer – CAS fails if current pointer not equal to initial value – Retry on failure NBS should enable fast reads … in theory!

43 Jonathan Walpole CS533 Winter 2006 Problems with NBS in practice Reusing memory causes problems – Readers holding references can be hijacked during data structure traversals when memory is reclaimed – Readers see inconsistent data structures when memory is reused How and when should memory be reclaimed?

44 Jonathan Walpole CS533 Winter 2006 Immediate reclamation?

45 Jonathan Walpole CS533 Winter 2006 Immediate reclamation? In practice, readers must either – Use LL/SC to test if pointers have changed, or – Verify that version numbers associated with data structures have not changed (2 memory barriers) Synchronization instructions slow NBS readers!

46 Jonathan Walpole CS533 Winter 2006 Reader-friendly solutions Never reclaim memory ? Type-stable memory ? – Needs free pool per data structure type – Readers can still be hijacked to the free pool – Exposes OS to denial of service attacks Ideally, defer reclaiming memory until its safe! – Defer reclamation of a data item until references to it are no longer held by any thread

47 Jonathan Walpole CS533 Winter 2006 Wait for a while then delete? – … but how long should you wait? Maintain reference counts or per-CPU “hazard pointers” on data that is in use? How should we defer reclamation?

48 Jonathan Walpole CS533 Winter 2006 How should we defer reclamation? Wait for a while then delete? – … but how long should you wait? Maintain reference counts or per-CPU “hazard pointers” on data that is in use? – Requires synchronization in read path! Challenge – deferring destruction without using synchronization instructions in the read path

49 Jonathan Walpole CS533 Winter 2006 Coding convention: – Don’t allow a quiescent state to occur in a read-side critical section Reclamation strategy: – Only reclaim data after all CPUs in the system have passed through a quiescent state Example quiescent states: – Context switch in non-preemptive kernel – Yield in preemptive kernel – Return from system call … Quiescent-state-based reclamation

50 Jonathan Walpole CS533 Winter 2006 Coding conventions for readers Delineate read-side critical section – rcu_read_lock() and rcu_read_unlock() primitives – may compile to nothing on most architectures Don’t hold references outside critical sections – Re-traverse data structure to pick up reference Don’t yield the CPU during critical sections – Don’t voluntarily yield – Don’t block, don’t leave the kernel …

51 Jonathan Walpole CS533 Winter 2006 Overview of the basic idea Writers create new versions – Using locking or NBS to synchronize with each other – Register call-backs to destroy old versions when safe ● call_rcu() primitive registers a call back with a reclaimer – Call-backs are deferred and memory reclaimed in batches Readers do not use synchronization – While they hold a reference to a version it will not be destroyed – Completion of read-side critical sections is “inferred” by the reclaimer from observation of quiescent states

52 Jonathan Walpole CS533 Winter 2006 Overview of RCU API WriterReader Reclaimer rcu_dereference () rcu_assign_pointer () rcu_read_lock () synchronize_rcu () call_rcu () Memory Consistency of Mutable Pointers Collection of versions of Immutable Objects

53 Jonathan Walpole CS533 Winter 2006 Context switch as a quiescent state CPU 0 CPU 1 RCU Read-Side Critical Section RCU Read-Side Critical Section Remove Element Context Switch Context Switch RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section May hold reference Can't hold reference to old version, but RCU can't tell Can't hold reference to old version Context Switch

54 Jonathan Walpole CS533 Winter 2006 Grace periods CPU 0 CPU 1 RCU Read-Side Critical Section RCU Read-Side Critical Section Delete Element Context Switch Context Switch RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section Context Switch Grace Period Context Switch Grace Period

55 Jonathan Walpole CS533 Winter 2006 Example quiescent states – Context switch (non-preemptive kernels) – Voluntary context switch (preemptive kernels) – Kernel entry/exit – Blocking call Grace periods – A period during which every CPU has gone through a quiescent state Quiescent states and grace periods

56 Jonathan Walpole CS533 Winter 2006 Efficient implementation Choosing good quiescent states – They should occur anyway – They should be easy to count – Not too frequent or infrequent Recording and dispatching call-backs – Minimize inter-CPU communication – Maintain per-CPU queues of call-backs – Two queues – waiting for grace period start and end

57 Jonathan Walpole CS533 Winter 2006 'Next' RCU Callbacks RCU's data structures 'Current' RCU Callback Grace-Period Number Global Grace-Period Number Global CPU Bitmask call_rcu() Counter Snapshot End of Previous Grace Period (If Any) End of Current Grace Period

58 Jonathan Walpole CS533 Winter 2006 RCU implementations DYNIX/ptx RCU (data center) Linux – Multiple implementations (in 2.5 and 2.6 kernels) – Preemptible and nonpreemptible Tornado/K42 “generations” – Preemptive kernel – Helped generalize usage

59 Jonathan Walpole CS533 Winter 2006 Experimental results How do different combinations of RCU, SMR, NBS and Locking compare? Hash table mini-benchmark running on 1.45 GHz POWER4 system with 8 CPUs Various workloads – Read/update fraction – Hash table size – Memory constraints – Number of CPUs

60 Jonathan Walpole CS533 Winter 2006 Scalability with working set in cache

61 Jonathan Walpole CS533 Winter 2006 Scalability with large working set

62 Jonathan Walpole CS533 Winter 2006 Performance at different update fractions (8 CPUs)

63 Jonathan Walpole CS533 Winter 2006 Performance at different update fractions (2 CPUs)

64 Jonathan Walpole CS533 Winter 2006 Performance in read-mostly scenarios

65 Jonathan Walpole CS533 Winter 2006 Impact of memory constraints

66 Jonathan Walpole CS533 Winter 2006 Performance and complexity When should RCU be used? – Instead of simple spinlock? – Instead of per-CPU reader-writer lock? Under what environmental conditions? – Memory-latency ratio – Number of CPUs Under what workloads? – Fraction of accesses that are updates – Number of updates per grace period

67 Jonathan Walpole CS533 Winter 2006 Analytic results Compute breakeven update-fraction contours for RCU vs. locking performance, against: – Number of CPUs (n) – Updates per grace period ( ) – Memory-latency ratio (r) Look at computed memory-latency ratio at extreme values of for n=4 CPUs

68 Jonathan Walpole CS533 Winter 2006 Breakevens for RCU worst case (f vs. r for Small )

69 Jonathan Walpole CS533 Winter 2006 Breakeven for RCU best case (f vs. r, Large )

70 Jonathan Walpole CS533 Winter 2006 Real-world performance and complexity SysV IPC – >10x on microbenchmark (8 CPUs) – 5% for database benchmark (2 CPUs) – 151 net lines added to the kernel Directory-Entry Cache – +20% in multiuser benchmark (16 CPUs) – +12% on SPECweb99 (8 CPUs) – -10% time required to build kernel (16 CPUs) – 126 net lines added to the kernel

71 Jonathan Walpole CS533 Winter 2006 Real-world performance and complexity Task List – +10% in multiuser benchmark (16 CPUs) – 6 net lines added to the kernel ● 13 added ● 7 deleted

72 Jonathan Walpole CS533 Winter 2006 Summary and Conclusions (1) RCU can provide order-of-magnitude speedups for read-mostly data structures – RCU optimal when less than 10% of accesses are updates over wide range of CPUs – RCU projected to remain useful in future CPU architectures In Linux 2.6 kernel, RCU provided excellent performance with little added complexity – Currently over 700 uses of RCU API in Linux kernel

73 Jonathan Walpole CS533 Winter 2006 Summary and Conclusions (2) RCU introduces a new model and API for synchronization – There is additional complexity – Visual inspection of kernel code has uncovered some subtle bugs in use of RCU API primitives – Tools to ensure correct use of API primitives are needed

74 Jonathan Walpole CS533 Winter 2006 A thought “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it !” – Brian Kernighan

75 Jonathan Walpole CS533 Winter 2006 Use the right tool for the job!!!

76 Jonathan Walpole CS533 Winter 2006 Spare slides

77 Jonathan Walpole CS533 Winter 2006 How does RCU address overheads? Lock Contention – Readers need not acquire locks: no contention!!! – Writers can still suffer lock contention ● But only with each other, and writers are infrequent ● Very little contention!!! Memory Latency – Readers do not perform memory writes – No need to communicate data among CPUs for cache consistency ● Memory latency greatly reduced

78 Jonathan Walpole CS533 Winter 2006 How does RCU address overheads? Pipeline-Stall Overhead – On most CPUs, readers do not stall pipeline due to update ordering or atomic operations Instruction Overhead – No atomic instructions required for readers – Readers only need to execute fast instructions