Multiprocessor Programming

Multiprocessor Programming
Transactional Memory The Art of Multiprocessor Programming

Multicore Architetures
“Learn how the multi-core processor architecture plays a central role in Intel's platform approach. ….” “AMD is leading the industry to multi-core technology for the x86 based computing market …” “Sun's multicore strategy centers around multi-threaded software. ... “ Art of Multiprocessor Programming

Why do we care? Time no longer cures software bloat
When you double your path length You can’t just wait 6 months Your software must somehow exploit twice as much concurrency © 2006 Herlihy & Shavit

The Problem Cannot exploit cheap threads Today’s Software
Non-scalable methodologies Today’s Hardware Poor support for scalable synchronization © 2006 Herlihy & Shavit

Coarse-Grained Locking
Easily made correct … But not scalable. © 2006 Herlihy & Shavit

Why Locking Doesn’t Scale
Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit

No one else can make progress
Locks are not Robust If a thread holding a lock is delayed … No one else can make progress © 2006 Herlihy & Shavit

Locking Relies on Conventions
Relation between Lock bit and object bits Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) /* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */ © 2006 Herlihy & Shavit

No interference if ends “far enough” apart
Sadistic Homework enq(x) enq(y) Double-ended queue No interference if ends “far enough” apart © 2006 Herlihy & Shavit

Interference OK if ends “close enough” together
Sadistic Homework enq(x) enq(y) Double-ended queue Interference OK if ends “close enough” together © 2006 Herlihy & Shavit

Make sure suspended dequeuers awake as needed
Sadistic Homework deq() deq() Double-ended queue  Make sure suspended dequeuers awake as needed © 2006 Herlihy & Shavit

You Try It … One lock? Locks at each end? Waking blocked dequeuers?
Too Conservative Locks at each end? Deadlock, too complicated, etc Waking blocked dequeuers? Harder than it looks © 2006 Herlihy & Shavit

Actual Solution Clean solution would be a publishable result
[Michael & Scott, PODC 96] High performance fine-grained lock-based solutions are good for libraries… not general consumption by programmers © 2006 Herlihy & Shavit

Exposing lock internals breaks abstraction
Locks do not compose Hash Table lock T1 Must lock T1 before adding item add(T1, item) item Move from T1 to T2 lock T1 lock T2 delete(T1, item) add(T2, item) Must lock T2 before deleting from T1 item item Exposing lock internals breaks abstraction © 2006 Herlihy & Shavit

Monitor Wait and Signal
Empty zzz buffer Yes! If buffer is empty, wait for item to show up © 2006 Herlihy & Shavit

Wait and Signal do not Compose
empty Wait for either? empty zzz… © 2006 Herlihy & Shavit

The Transactional Manifesto
What we do now is inadequate to meet the multicore challenge Research Agenda Replace locking with a transactional API Design languages to support this model Implement the run-time to be fast enough © 2006 Herlihy & Shavit

Transactions Atomic Linearizable Commit: takes effect
Abort: effects rolled back Usually retried Linearizable Appear to happen in one-at-a-time order © 2006 Herlihy & Shavit

Sadistic Homework Revisited
Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code (1) © 2006 Herlihy & Shavit

Warning Not always this simple But often it is Conditional waits
Enhanced concurrency Overlapping locks But often it is Works for sadistic homework © 2006 Herlihy & Shavit

Composition Trivial or what? Public void Transfer(Queue q1, q2) {
atomic { Item x = q1.deq(); q2.enq(x); } Trivial or what? (1) © 2006 Herlihy & Shavit

Wake-ups: lost and found
Public item LeftDeq() { atomic { if (this.left == null) retry; … } Roll back transaction and restart when something changes (1) © 2006 Herlihy & Shavit

OrElse Composition Run 1st method. If it retries …
atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries © 2006 Herlihy & Shavit

Transactional Memory [HerlihyMoss93]
© 2006 Herlihy & Shavit

Hardware Overview Exploit Cache coherence protocols
Already do almost what we need Invalidation Consistency checking Exploit Speculative execution Branch prediction = optimistic synch! Original TM proposal by Herlihy and Moss 1993… © 2006 Herlihy & Shavit

HW Transactional Memory
read active T caches Interconnect memory © 2006 Herlihy & Shavit

Transactional Memory read active active T T caches memory

Transactional Memory active active committed T T caches memory

Transactional Memory write active committed T D caches memory

Rewind write aborted active active T T D caches memory

Transaction Commit At commit point Mark transactional entries
If no cache conflicts, we win. Mark transactional entries Read-only: valid Modified: dirty (eventually written back) That’s all, folks! Except for a few details … © 2006 Herlihy & Shavit

Not all Skittles and Beer
Limits to Transactional cache size Scheduling quantum Transaction cannot commit if it is Too big Too slow Actual limits platform-dependent © 2006 Herlihy & Shavit

Some Approaches Trap to software if hardware fails
“Hybrid” approach Use locks if speculation fails Lock elision “Virtualize” transactional memory VTM, UTM, etc… © 2006 Herlihy & Shavit

Hardware versus Software
Do we need hardware at all? Like virtual memory, probably need HW for performance Do we need software? Policy issues don’t make sense for hardware © 2006 Herlihy & Shavit

Transactional Locking
Build STM based on locks Atomic{} simple as course grained locking and it composes Under the covers (transparent to user) we deploy fine grained locks © 2006 Herlihy & Shavit

Locking STM Design Choices
Map Array of Versioned- Write-Locks Application Memory V# PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object) © 2006 Herlihy & Shavit

Lock-based STM Mem Locks V# 0 V#+1 0 X Y V# 0 V# 0 V# 0
To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#’s unchanged Release each lock with v#+1 X V# V# V# V# V#+1 0 V# Y V# V# V# V# V#+1 0 V#+1 0 V# V# V# V# V# 0 V# © 2006 Herlihy & Shavit

Transactional Consistency
Invariant x = 2y 8 4 4 X Transaction A: Write x Write y 2 Y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory

Transactional Inconsistency
Invariant x = 2y 4 8 4 X Transaction B: Read x = 4 2 Y Transaction A: Write x Write y Zombie transactiosn are ones that are doomed to fail (here trans will fail validatuon at the end) but go on Computing in a way that can cause problems Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) Application Memory DIV by 0 ERROR

Solution: The “Global Clock”
Have one shared global clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent

Solution: “Version Clock”
Have one shared global version clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent Later: how we learned not to worry about contention and love the clock © 2006 Herlihy & Shavit

Version Clock: Read-OnlyTrans
Mem Locks 100 VClock V# RV  VClock On Read: read lock, read mem, read lock: check unlocked, unchanged, and v# <= RV Commit. V# Reads form a snapshot of memory. No read set! V# 100 RV © 2006 Herlihy & Shavit

Version Clock 100 120 100 121 VClock Mem Locks 87 0 121 0 88 0 V# 0
V# RV  VClock On Read/Write: check unlocked and v# <= RV then add to Read/Write-Set Acquire Locks WV = F&I(VClock) Validate each v# <= RV Release locks with v#  WV X X Y Y V# Reads+Inc+Writes =Linearizable 100 RV Commit © 2006 Herlihy & Shavit

Performance Benchmarks
Mechanically Transformed Sequential Red-Black Tree using TL2 Compare to STMs and hand-crafted fine-grained Red-Black implementation On a 16–way Sun Fire™ running Solaris™ 10 © 2006 Herlihy & Shavit

Uncontended Large Red-Black Tree
Hand-crafted 5% Delete 5% Update 90% Lookup TL/PO TL2/P0 Ennals TL/PS TL2/PS FarserHarris Lock-free © 2006 Herlihy & Shavit

What Next? On multicores like Niagara Global Clock not a big issue.
But there is a lot of TM research work ahead Do you want to help…? © 2006 Herlihy & Shavit

Multiprocessor Programming

Similar presentations

Presentation on theme: "Multiprocessor Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiprocessor Programming

Similar presentations

Presentation on theme: "Multiprocessor Programming"— Presentation transcript:

Similar presentations

About project

Feedback