Multiprocessor Programming Transactional Memory The Art of Multiprocessor Programming
Multicore Architetures “Learn how the multi-core processor architecture plays a central role in Intel's platform approach. ….” “AMD is leading the industry to multi-core technology for the x86 based computing market …” “Sun's multicore strategy centers around multi-threaded software. ... “ Art of Multiprocessor Programming
Why do we care? Time no longer cures software bloat When you double your path length You can’t just wait 6 months Your software must somehow exploit twice as much concurrency © 2006 Herlihy & Shavit
The Problem Cannot exploit cheap threads Today’s Software Non-scalable methodologies Today’s Hardware Poor support for scalable synchronization © 2006 Herlihy & Shavit
Locking © 2006 Herlihy & Shavit
Coarse-Grained Locking Easily made correct … But not scalable. © 2006 Herlihy & Shavit
Fine-Grained Locking Here comes trouble … © 2006 Herlihy & Shavit
Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
No one else can make progress Locks are not Robust If a thread holding a lock is delayed … No one else can make progress © 2006 Herlihy & Shavit
Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
Locking Relies on Conventions Relation between Lock bit and object bits Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) /* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */ © 2006 Herlihy & Shavit
Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
No interference if ends “far enough” apart Sadistic Homework enq(x) enq(y) Double-ended queue No interference if ends “far enough” apart © 2006 Herlihy & Shavit
Interference OK if ends “close enough” together Sadistic Homework enq(x) enq(y) Double-ended queue Interference OK if ends “close enough” together © 2006 Herlihy & Shavit
Make sure suspended dequeuers awake as needed Sadistic Homework deq() deq() Double-ended queue Make sure suspended dequeuers awake as needed © 2006 Herlihy & Shavit
You Try It … One lock? Locks at each end? Waking blocked dequeuers? Too Conservative Locks at each end? Deadlock, too complicated, etc Waking blocked dequeuers? Harder than it looks © 2006 Herlihy & Shavit
Actual Solution Clean solution would be a publishable result [Michael & Scott, PODC 96] High performance fine-grained lock-based solutions are good for libraries… not general consumption by programmers © 2006 Herlihy & Shavit
Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
Exposing lock internals breaks abstraction Locks do not compose Hash Table lock T1 Must lock T1 before adding item add(T1, item) item Move from T1 to T2 lock T1 lock T2 delete(T1, item) add(T2, item) Must lock T2 before deleting from T1 item item Exposing lock internals breaks abstraction © 2006 Herlihy & Shavit
Monitor Wait and Signal Empty zzz buffer Yes! If buffer is empty, wait for item to show up © 2006 Herlihy & Shavit
Wait and Signal do not Compose empty Wait for either? empty zzz… © 2006 Herlihy & Shavit
The Transactional Manifesto What we do now is inadequate to meet the multicore challenge Research Agenda Replace locking with a transactional API Design languages to support this model Implement the run-time to be fast enough © 2006 Herlihy & Shavit
Transactions Atomic Linearizable Commit: takes effect Abort: effects rolled back Usually retried Linearizable Appear to happen in one-at-a-time order © 2006 Herlihy & Shavit
Sadistic Homework Revisited Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code (1) © 2006 Herlihy & Shavit
Sadistic Homework Revisited Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } (1) © 2006 Herlihy & Shavit
Sadistic Homework Revisited Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Enclose in atomic block (1) © 2006 Herlihy & Shavit
Warning Not always this simple But often it is Conditional waits Enhanced concurrency Overlapping locks But often it is Works for sadistic homework © 2006 Herlihy & Shavit
Composition Trivial or what? Public void Transfer(Queue q1, q2) { atomic { Item x = q1.deq(); q2.enq(x); } Trivial or what? (1) © 2006 Herlihy & Shavit
Wake-ups: lost and found Public item LeftDeq() { atomic { if (this.left == null) retry; … } Roll back transaction and restart when something changes (1) © 2006 Herlihy & Shavit
OrElse Composition Run 1st method. If it retries … atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries © 2006 Herlihy & Shavit
Transactional Memory [HerlihyMoss93] © 2006 Herlihy & Shavit
Hardware Overview Exploit Cache coherence protocols Already do almost what we need Invalidation Consistency checking Exploit Speculative execution Branch prediction = optimistic synch! Original TM proposal by Herlihy and Moss 1993… © 2006 Herlihy & Shavit
HW Transactional Memory read active T caches Interconnect memory © 2006 Herlihy & Shavit
Transactional Memory read active active T T caches memory © 2006 Herlihy & Shavit
Transactional Memory active active committed T T caches memory © 2006 Herlihy & Shavit
Transactional Memory write active committed T D caches memory © 2006 Herlihy & Shavit
Rewind write aborted active active T T D caches memory © 2006 Herlihy & Shavit
Transaction Commit At commit point Mark transactional entries If no cache conflicts, we win. Mark transactional entries Read-only: valid Modified: dirty (eventually written back) That’s all, folks! Except for a few details … © 2006 Herlihy & Shavit
Not all Skittles and Beer Limits to Transactional cache size Scheduling quantum Transaction cannot commit if it is Too big Too slow Actual limits platform-dependent © 2006 Herlihy & Shavit
Some Approaches Trap to software if hardware fails “Hybrid” approach Use locks if speculation fails Lock elision “Virtualize” transactional memory VTM, UTM, etc… © 2006 Herlihy & Shavit
Hardware versus Software Do we need hardware at all? Like virtual memory, probably need HW for performance Do we need software? Policy issues don’t make sense for hardware © 2006 Herlihy & Shavit
Transactional Locking Build STM based on locks Atomic{} simple as course grained locking and it composes Under the covers (transparent to user) we deploy fine grained locks © 2006 Herlihy & Shavit
Locking STM Design Choices Map Array of Versioned- Write-Locks Application Memory V# PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object) © 2006 Herlihy & Shavit
Lock-based STM Mem Locks V# 0 V#+1 0 X Y V# 0 V# 0 V# 0 To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#’s unchanged Release each lock with v#+1 X V# 0 V# 1 V# 1 V# 0 V#+1 0 V# 0 Y V# 1 V# 0 V# 0 V# 0 V#+1 0 V#+1 0 V# 1 V# 0 V# 0 V# 0 V# 0 V# 0 © 2006 Herlihy & Shavit
Transactional Consistency Invariant x = 2y 8 4 4 X Transaction A: Write x Write y 2 Y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory
Transactional Inconsistency Invariant x = 2y 4 8 4 X Transaction B: Read x = 4 2 Y Transaction A: Write x Write y Zombie transactiosn are ones that are doomed to fail (here trans will fail validatuon at the end) but go on Computing in a way that can cause problems Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) Application Memory DIV by 0 ERROR
Solution: The “Global Clock” Have one shared global clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent
Solution: “Version Clock” Have one shared global version clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent Later: how we learned not to worry about contention and love the clock © 2006 Herlihy & Shavit
Version Clock: Read-OnlyTrans Mem Locks 100 VClock 87 0 34 0 88 0 V# 0 44 0 99 0 50 0 87 0 87 0 87 0 RV VClock On Read: read lock, read mem, read lock: check unlocked, unchanged, and v# <= RV Commit. 34 0 34 0 34 0 88 0 99 0 99 0 V# 0 99 0 44 0 Reads form a snapshot of memory. No read set! 50 0 V# 0 50 0 50 0 100 RV © 2006 Herlihy & Shavit
Version Clock 100 120 100 121 VClock Mem Locks 87 0 121 0 88 0 V# 0 87 0 121 0 88 0 V# 0 44 0 50 0 87 0 87 0 87 0 87 0 RV VClock On Read/Write: check unlocked and v# <= RV then add to Read/Write-Set Acquire Locks WV = F&I(VClock) Validate each v# <= RV Release locks with v# WV X X 34 0 34 1 34 0 34 0 121 0 88 0 Y Y 121 0 99 1 99 0 99 0 99 0 44 0 50 0 V# 0 50 0 50 0 50 0 Reads+Inc+Writes =Linearizable 100 RV Commit © 2006 Herlihy & Shavit
Performance Benchmarks Mechanically Transformed Sequential Red-Black Tree using TL2 Compare to STMs and hand-crafted fine-grained Red-Black implementation On a 16–way Sun Fire™ running Solaris™ 10 © 2006 Herlihy & Shavit
Uncontended Large Red-Black Tree Hand-crafted 5% Delete 5% Update 90% Lookup TL/PO TL2/P0 Ennals TL/PS TL2/PS FarserHarris Lock-free © 2006 Herlihy & Shavit
Uncontended Small RB-Tree 5% Delete 5% Update 90% Lookup TL/P0 TL2/P0 © 2006 Herlihy & Shavit
Contended Small RB-Tree TL/P0 30% Delete 30% Update 40% Lookup TL2/P0 Ennals © 2006 Herlihy & Shavit
What Next? On multicores like Niagara Global Clock not a big issue. But there is a lot of TM research work ahead Do you want to help…? © 2006 Herlihy & Shavit