Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

- Elbert Hubbard “Life is just one darned thing after another”

“Multiprocessing is just one darned thing before, after or simultaneously with another”

“Synchronization is about imposing order”

5 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The problem – race conditions “i++” is dangerous if “i” is global CPU 0 load %1,i inc %1 store %1,i i CPU 0

6 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The problem – race conditions “i++” is dangerous if “i” is global CPU 0 load %1,i inc %1 store %1,i i i CPU 0 load %1,i i

7 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The problem – race conditions “i++” is dangerous if “i” is global CPU 0 inc %1 load %1,i inc %1 store %1,i i i+1 CPU 0 inc %1 i+1

8 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The problem – race conditions “i++” is dangerous if “i” is global CPU 0 store %1,i load %1,i inc %1 store %1,i i+1 CPU 0 store %1,i i+1

9 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The solution – critical sections Classic multiprocessor solution: spinlocks – CPU 1 waits for CPU 0 to release the lock Counts are accurate, but locks are not free! spin_lock(&mylock); i++; spin_unlock(&mylock);

10 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Critical-section efficiency Lock Acquisition (T a ) Critical Section (T c ) Lock Release (T r ) Critical-section efficiency = TcTc T c +T a +T r Ignoring lock contention and cache conflicts in the critical section

11 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Critical section efficiency Critical Section Size

12 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Performance of normal instructions

13 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 What’s going on? Taller memory hierarchies – Memory speeds have not kept up with CPU speeds – 1984: no caches needed, since instructions were slower than memory accesses – 2005: 3-4 level cache hierarchies, since instructions are orders of magnitude faster than memory accesses

14 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Why does this matter? Synchronization implies sharing data across CPUs – normal instructions tend to hit in top-level cache – synchronization operations tend to miss Synchronization requires a consistent view of data – between cache and memory – across multiple CPUs – requires CPU-CPU communication Synchronization instructions see memory latency!

15 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 … but that’s not all! Longer pipelines – 1984: Many clocks per instruction – 2005: Many instructions per clock, 20-stage pipelines Out of order execution – Keeps the pipelines full – Must not reorder the critical section before its lock! Synchronization instructions stall the pipeline!

16 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Reordering means weak memory consistency Memory barriers - Additional synchronization instructions are needed to manage reordering

17 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 What is the cost of all this? Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction1.0

18 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Atomic increment Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment 1.0 183.1 1.0 402.3

19 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Memory barriers Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier 1.0 183.1 328.6 328.9 400.9 1.0 402.3 0.0 402.3 0.0

20 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Lock acquisition/release with LL/SC Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip 1.0 183.1 328.6 328.9 400.9 1057.5 1.0 402.3 0.0 402.3 0 1138.8

21 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Compare & swap unknown values (NBS) Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip CAS Cache Transfer & Invalidate 1.0 183.1 328.6 328.9 400.9 1057.5 247.1 1.0 402.3 0.0 402.3 0 1138.8 847.1

22 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Compare & swap known values (spinlocks) Instruction Cost 1.45 GHz3.06GHz IBM POWER4Intel Xeon Normal Instruction Atomic Increment SMP Write Memory Barrier Read Memory Barrier Write Memory Barrier Local Lock Round Trip CAS Cache Transfer & Invalidate CAS Blind Cache Transfer 1.0 183.1 328.6 328.9 400.9 1057.5 247.1 257.1 1.0 402.3 0.0 402.3 0 1138.8 847.1 993.9

23 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The net result? 1984: Lock contention was the main issue 2005: Critical section efficiency is a key issue Even if the lock is always free when you try to acquire it, performance can still suck!

24 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 How has this affected OS design? Multiprocessor OS designers search for “scalable” synchronization strategies – reader-writer locking instead of global locking – data locking and partitioning – Per-CPU reader-writer locking – Non-blocking synchronization The “common case” is read-mostly access to linked lists and hash-tables – asymmetric strategies favouring readers are good

25 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Review - Global locking A symmetric approach (also called “code locking”) – A critical section of code is guarded by a lock – Only one thread at a time can hold the lock Examples include – Monitors – Java “synchronized” on global object – Linux spin_lock() on global spinlock_t Global locking doesn’t scale due to lock contention!

26 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Review - Reader-writer locking Many readers can concurrently hold the lock Writers exclude readers and other writers The result? – No lock contention in read-mostly scenarios – So it should scale well, right? – … wrong!

27 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Scalability of reader/writer locking CPU 0 CPU 1 read-acquire memory barrier read-acquire memory barrier read-acquire memory barrier critical section critical section lock Reader/writer locking does not scale due to critical section efficiency!

28 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Review - Data locking A lock per data item instead of one per collection – Per-hash-bucket locking for hash tables – CPUs acquire locks for different hash chains in parallel – CPUs incur memory-latency and pipeline-flush overheads in parallel Data locking improves scalability by executing critical section overhead in parallel

29 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Review - Per-CPU reader-writer locking One lock per CPU (called brlock in Linux) – Readers acquire their own CPU’s lock – Writers acquire all CPU’s locks In read-only workloads CPUs never exchange locks – no memory latency is incurred Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios

30 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Scalability comparison Expected scalability on read-mostly workloads – Global locking – poor due to lock contention – R/W locking – poor due to critical section efficiency – Data locking – better? – R/W data locking – better still? – Per-CPU R/W locking – the best we can do?

31 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Actual scalability Scalability of locking strategies using read- only workloads in a hash-table benchmark Measurements taken on a 4-CPU 700 MHz P-III system Similar results are obtained on more recent CPUs

32 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Scalability on 1.45 GHz POWER4 CPUs

33 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Performance at different update fractions on 8 1.45 GHz POWER4 CPUs

34 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 What are the lessons so far? Avoid lock contention ! Avoid synchronization instructions ! – … especially in the read-path !

35 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 How about non-blocking synchronization? Basic idea – copy & flip pointer (no locks!) – Read a pointer to a data item – Create a private copy of the item to update in place – Swap the old item for the new one using an atomic compare & swap (CAS) instruction on its pointer – CAS fails if current pointer not equal to initial value – Retry on failure NBS should enable fast reads … in theory!

36 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Problems with NBS in practice Reusing memory causes problems – Readers holding references can be hijacked during data structure traversals when memory is reclaimed – Readers see inconsistent data structures when memory is reused How and when should memory be reclaimed?

37 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Immediate reclamation? In practice, readers must either – Use LL/SC to test if pointers have changed, or – Verify that version numbers associated with data structures have not changed (2 memory barriers) Synchronization instructions slow NBS readers!

38 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Reader-friendly solutions Never reclaim memory ? Type-stable memory ? – Needs free pool per data structure type – Readers can still be hijacked to the free pool – Exposes OS to denial of service attacks Ideally, defer reclaiming memory until its safe! – Defer reclamation of a data item until references to it are no longer held by any thread

39 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 How should we defer reclamation? Wait for a while then delete? – … but how long should you wait? Maintain reference counts or per-CPU hazard pointers on data? – Requires synchronization in read path! Challenge – deferring destruction without using synchronization instructions in the read path

40 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Coding convention: – Don’t allow a quiescent state to occur in a read-side critical section Reclamation strategy: – Only reclaim data after all CPUs in the system have passed through a quiescent state Example quiescent states: – Context switch in non-preemptive kernel – Yield in preemptive kernel – Return from system call … Quiescent-state-based reclamation

41 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Coding conventions for readers Delineate read-side critical section – Compiles to nothing on most architectures Don’t hold references outside critical sections – Re-traverse data structure to pick up reference Don’t yield the CPU during critical sections – Don’t voluntarily yield – Don’t block, don’t leave the kernel …

42 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Overview of the basic idea Writers create new versions – Using locking or NBS to synchronize with each other – Register call-backs to destroy old versions when safe – Call-backs are deferred and memory reclaimed in batches Readers do not use synchronization – While they hold a reference to a version it will not be destroyed – Completion of read-side critical sections inferred from observation of quiescent states

43 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Context switch as a quiescent state CPU 0 CPU 1 RCU Read-Side Critical Section RCU Read-Side Critical Section Remove Element Context Switch Context Switch RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section May hold reference Can't hold reference to old version, but RCU can't tell Can't hold reference to old version Context Switch

44 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Grace periods CPU 0 CPU 1 RCU Read-Side Critical Section RCU Read-Side Critical Section Delete Element Context Switch Context Switch RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section Context Switch Grace Period Context Switch Grace Period

45 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Example quiescent states – Context switch (non-preemptive kernels) – Voluntary context switch (preemptive kernels) – Kernel entry/exit – Blocking call Grace periods – A period during which every CPU has gone through a quiescent state Quiescent states and grace periods

46 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Efficient implementation Choosing good quiescent states – Occur anyway – Easy to count – Not too frequent or infrequent Recording and dispatching call-backs – Minimize inter-CPU communication – Maintain per-CPU queues of call-backs – Two queues – waiting for grace period start and end

47 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 'Next' RCU Callbacks RCU's data structures 'Current' RCU Callback Grace-Period Number Global Grace-Period Number Global CPU Bitmask call_rcu() Counter Snapshot End of Previous Grace Period (If Any) End of Current Grace Period

48 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 RCU implementations DYNIX/ptx RCU (data center) Linux – Multiple implementations (in 2.5 and 2.6 kernels) – Preemptible and nonpreemptible Tornado/K42 “generations” – Preemptive kernel – Helped generalize usage

49 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Experimental results How do different combinations of RCU, SMR, NBS and Locking compare? Hash table mini-benchmark running on 1.45 GHz POWER4 system with 8 CPUs Various workloads – Read/update fraction – Hash table size – Memory constraints – Number of CPUs

50 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Scalability with working set in cache

51 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Scalability with large working set

52 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Performance at different update fractions (8 CPUs)

53 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Performance at different update fractions (2 CPUs)

54 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Performance in read-mostly scenarios

55 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Impact of memory constraints

56 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 ADDITIONAL SLIDES The following slides relate to a different paper.

57 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Performance and complexity When should RCU be used? – Instead of simple spinlock? – Instead of per-CPU reader-writer lock? Under what environmental conditions? – Memory-latency ratio – Number of CPUs Under what workloads? – Fraction of accesses that are updates – Number of updates per grace period

58 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Analytic results Compute breakeven update-fraction contours for RCU vs. locking performance, against: – Number of CPUs (n) – Updates per grace period ( ) – Memory-latency ratio (r) Look at computed memory-latency ratio at extreme values of for n=4 CPUs

59 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Breakevens for RCU worst case (f vs. r for Small )

60 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Breakeven for RCU best case (f vs. r, Large )

61 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Validation of analytic results 4-CPU 700MHz P-III system (NUMA-Q quad) Read-only mini-benchmark – For data structures that are almost never modified ● Routing tables, HW/SW configuration, policies Mixed workload mini-benchmark – Vary fraction of accesses that are updates – See how things change as read-intensity varies – Expect breakeven point for RCU and locking

62 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Benchmark results (read-only)

63 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Benchmark results for mixed workloads

64 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Real-world performance and complexity SysV IPC – >10x on microbenchmark (8 CPUs) – 5% for database benchmark (2 CPUs) – 151 net lines added to the kernel Directory-Entry Cache – +20% in multiuser benchmark (16 CPUs) – +12% on SPECweb99 (8 CPUs) – -10% time required to build kernel (16 CPUs) – 126 net lines added to the kernel

65 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Real-world performance and complexity Task List – +10% in multiuser benchmark (16 CPUs) – 6 net lines added to the kernel ● 13 added ● 7 deleted

66 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Summary and Conclusions (1) RCU can provide order-of-magnitude speedups for read-mostly data structures – RCU optimal when less than 10% of accesses are updates over wide range of CPUs – RCU projected to remain useful in future CPU architectures In Linux 2.6 kernel, RCU provided excellent performance with little added complexity

67 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 How does RCU address overheads? Lock Contention – Readers need not acquire locks: no contention!!! – Writers can still suffer lock contention ● But only with each other, and writers are infrequent ● Very little contention!!! Memory Latency – Readers do not perform memory writes – No need to communicate data among CPUs for cache consistency ● Memory latency greatly reduced

68 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 How does RCU address overheads? Pipeline-Stall Overhead – On most CPUs, readers do not stall pipeline due to update ordering or atomic operations Instruction Overhead – No atomic instructions required for readers – Readers only need to execute fast instructions

69 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Summary and Conclusions (2) RCU best when designed in from the start – RCU added late to 2.5 kernel, limited changes feasible after Halloween feature freeze – Now doing more sweeping changes Use of design patterns key to RCU – RCU consistency semantics require transformation of some algorithms – Transformational design patterns can be used

70 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Future Work Formal model of RCU's semantics Formal model of consistency semantics for algorithms Tools to automatically transform algorithms into a form consistent with RCU's semantics Tools to automatically generate RCU from code that uses locking Apply RCU to other computational environments

71 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Use the right tool for the job!!!

72 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 BACKUP

73 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it !” – Brian Kernighan

74 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 RCU Thesis Publications McKenney et al. “Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications”, USENIX/UseLinux, 6/2004. McKenney, “Locking performance on different CPUs”, linux.conf.au, 1/2004. McKenney et al. “Scaling dcache with RCU”, Linux Journal, 1/2004. McKenney, “Using RCU in the Linux 2.6 kernel”, Linux Journal, 10/2003. Arcangeli et al. “Using read-copy update techniques for System V IPC in the Linux 2.5 kernel”, FREENIX, 6/2003. Appavoo et al. “Enabling autonomic behavior in systems software with hot swapping”, IBM Systems Journal, 1/2003. McKenney et al. “Read-copy update”, Ottawa Linux Symposium, 6/2002. McKenney et al. “Read-copy update”, Ottawa Linux Symposium, 7/2001. 24 additional publications, 14 patents, 22 patents pending.

75 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Related work Maintaining multiple versions [Kung, Herlihy] – Changes problem from inconsistency to staleness Deferring destruction [Kung,Hennessy] – Garbage collection for multiple versions reduces complexity – Batched destruction amortizes overhead

76 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Double-compare-&-swap DCAS (addr1, addr2, old1, old2, new1, new2) { if ((*addr1 == old1) && (*addr2 == old2)) { *addr1 = new1; *addr2 = new2; return(TRUE); } else { return(FALSE); }

77 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Non-blocking synchronization Basic idea – Data structures have version numbers – Updates committed using a single atomic instruction that fails if there are other concurrent updates – Retry on failure Simple implementations require a double-compare-&-swap (DCAS) instruction No locks/deadlock + synchronization-free readers! … in theory

78 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 In practice … Correctness of DCAS in failure case requires type-stable memory management – How is memory of old elements ever reclaimed? DCAS instruction not available in most hardware – Software implementations complex and costly – Require readers to use a memory barrier! – Performance worse than locking

79 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 So what now? How can we remove synchronization instructions from the common path in read-mostly scenarios? The good ideas – asymmetry between readers and writers – Data partitioning and Per-CPU locking – Hiding complex updates behind atomic commit points

80 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 The problems with NBS in practice Read-side memory barriers – Why not allow readers to see old versions? – Requires tolerance for small window of inconsistency Type-stable memory management – Why not reclaim memory safely by garbage collection? – … but can this be done efficiently? Dependence on obscure hardware – Why not use locking on the write side? – … also fixes write-side performance problems of NBS

81 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Instruction costs ( Measurements taken on a 4-CPU 700MHz i386 P-III )

82 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Scalability of locking strategies using read- only workloads in a hash-table benchmark Measurements taken on a 1.45 GHz Power machine Actual scalability

83 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Design patterns for RCU Design patterns capture the static and dynamic structures of solutions that occur repeatedly when producing applications in a particular context. Because they address fundamental challenges in software system development, design patterns are an important technique for improving the quality of software. Key challenges addressed by design patterns include communication of architectural knowledge among developers, accommodating a new design paradigm or architectural style, and avoiding development traps and pitfalls that are usually learned only by (painful) experience. Coplien and Schmidt, 1995

84 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Two Types of Design Patterns For RCU For situations well-suited to RCU: – Patterns that describe direct use of RCU For algorithms that do not tolerate RCU's stale- and inconsistent-data properties: – Patterns that describe transformations of algorithms into forms that can tolerate stale and/or inconsistent data

85 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Patterns for Direct RCU Use Reader/Writer-Lock/RCU Analogy – Routing tables, Linux tasklist lock patch,... RCU Readers With WFS Writers – K42 hash tables RCU Existence Locks – Ensure data structure persists as needed – Linux SysV IPC, dcache, IP route cache,... Pure RCU – Dynamic interrupt handlers... – Linux NMI handlers...

86 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Reader/Writer-Lock/RCU Analogy read_lock() read_unlock() write_lock() write_unlock() list_add() list_del() free(p) rcu_read_lock() rcu_read_unlock() spin_lock() spin_unlock() list_add_rcu() list_del_rcu() call_rcu(free, p)

87 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Patterns for Direct RCU Use Reader/Writer-Lock/RCU Analogy (5) RCU Readers With WFS Writers (1) RCU Existence Locks (7) Pure RCU (4)

88 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Stale and Inconsistent Data RCU allows concurrent readers and writers – RCU allows readers to access old versions ● Newly arriving readers will get most recent version ● Existing readers will get old version – RCU allows multiple simultaneous versions ● A given reader can access different versions while traversing an RCU-protected data structure ● Concurrent readers can be accessing different versions Some algorithms tolerate this consistency model, but many do not

89 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 RCU Transformational Patterns Substitute Copy for Original * Impose Level of Indirection Mark Obsolete Objects Ordered Update With Ordered Read Global Version Number Stall Updates

90 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Substitute Copy For Original In its pure form, RCU relies on atomic updates of a single value – Most CPUs support this If data structure requires multiple updates that must appear atomic to readers – Must hide updates behind a single atomic operation in order to apply RCU To provide atomicity: – Make a copy, update the copy, then substitute the copy for the original

91 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 ipc_ids Substitute Copy Animation 01234567 Sem0 Sem4 Sem6

92 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Substitute Copy Animation ipc_ids 01234567 Sem0 Sem4 Sem6 123456708... Sem8

93 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 Substitute Copy Animation ipc_ids Sem0 Sem4 Sem6 123456708... Sem8

94 www.cs.pdx.edu/~walpole Jonathan Walpole SFU Feb 2004 RCU Transformational Patterns Substitute Copy for Original (2) Impose Level of Indirection (~1) Mark Obsolete Objects (2) Ordered Update With Ordered Read (3) Global Version Number (2) Stall Updates (~1)

Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

Similar presentations

Presentation on theme: "Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

Similar presentations

Presentation on theme: "Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)"— Presentation transcript:

Similar presentations

About project

Feedback