Download presentation
Presentation is loading. Please wait.
1
Multiprocessor Programming
Transactional Memory The Art of Multiprocessor Programming
2
Multicore Architetures
“Learn how the multi-core processor architecture plays a central role in Intel's platform approach. ….” “AMD is leading the industry to multi-core technology for the x86 based computing market …” “Sun's multicore strategy centers around multi-threaded software. ... “ Art of Multiprocessor Programming
3
Why do we care? Time no longer cures software bloat
When you double your path length You can’t just wait 6 months Your software must somehow exploit twice as much concurrency © 2006 Herlihy & Shavit
4
The Problem Cannot exploit cheap threads Today’s Software
Non-scalable methodologies Today’s Hardware Poor support for scalable synchronization © 2006 Herlihy & Shavit
5
Locking © 2006 Herlihy & Shavit
6
Coarse-Grained Locking
Easily made correct … But not scalable. © 2006 Herlihy & Shavit
7
Fine-Grained Locking Here comes trouble … © 2006 Herlihy & Shavit
8
Why Locking Doesn’t Scale
Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
9
No one else can make progress
Locks are not Robust If a thread holding a lock is delayed … No one else can make progress © 2006 Herlihy & Shavit
10
Why Locking Doesn’t Scale
Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
11
Locking Relies on Conventions
Relation between Lock bit and object bits Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) /* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */ © 2006 Herlihy & Shavit
12
Why Locking Doesn’t Scale
Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
13
No interference if ends “far enough” apart
Sadistic Homework enq(x) enq(y) Double-ended queue No interference if ends “far enough” apart © 2006 Herlihy & Shavit
14
Interference OK if ends “close enough” together
Sadistic Homework enq(x) enq(y) Double-ended queue Interference OK if ends “close enough” together © 2006 Herlihy & Shavit
15
Make sure suspended dequeuers awake as needed
Sadistic Homework deq() deq() Double-ended queue Make sure suspended dequeuers awake as needed © 2006 Herlihy & Shavit
16
You Try It … One lock? Locks at each end? Waking blocked dequeuers?
Too Conservative Locks at each end? Deadlock, too complicated, etc Waking blocked dequeuers? Harder than it looks © 2006 Herlihy & Shavit
17
Actual Solution Clean solution would be a publishable result
[Michael & Scott, PODC 96] High performance fine-grained lock-based solutions are good for libraries… not general consumption by programmers © 2006 Herlihy & Shavit
18
Why Locking Doesn’t Scale
Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit
19
Exposing lock internals breaks abstraction
Locks do not compose Hash Table lock T1 Must lock T1 before adding item add(T1, item) item Move from T1 to T2 lock T1 lock T2 delete(T1, item) add(T2, item) Must lock T2 before deleting from T1 item item Exposing lock internals breaks abstraction © 2006 Herlihy & Shavit
20
Monitor Wait and Signal
Empty zzz buffer Yes! If buffer is empty, wait for item to show up © 2006 Herlihy & Shavit
21
Wait and Signal do not Compose
empty Wait for either? empty zzz… © 2006 Herlihy & Shavit
22
The Transactional Manifesto
What we do now is inadequate to meet the multicore challenge Research Agenda Replace locking with a transactional API Design languages to support this model Implement the run-time to be fast enough © 2006 Herlihy & Shavit
23
Transactions Atomic Linearizable Commit: takes effect
Abort: effects rolled back Usually retried Linearizable Appear to happen in one-at-a-time order © 2006 Herlihy & Shavit
24
Sadistic Homework Revisited
Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code (1) © 2006 Herlihy & Shavit
25
Sadistic Homework Revisited
Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } (1) © 2006 Herlihy & Shavit
26
Sadistic Homework Revisited
Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Enclose in atomic block (1) © 2006 Herlihy & Shavit
27
Warning Not always this simple But often it is Conditional waits
Enhanced concurrency Overlapping locks But often it is Works for sadistic homework © 2006 Herlihy & Shavit
28
Composition Trivial or what? Public void Transfer(Queue q1, q2) {
atomic { Item x = q1.deq(); q2.enq(x); } Trivial or what? (1) © 2006 Herlihy & Shavit
29
Wake-ups: lost and found
Public item LeftDeq() { atomic { if (this.left == null) retry; … } Roll back transaction and restart when something changes (1) © 2006 Herlihy & Shavit
30
OrElse Composition Run 1st method. If it retries …
atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries © 2006 Herlihy & Shavit
31
Transactional Memory [HerlihyMoss93]
© 2006 Herlihy & Shavit
32
Hardware Overview Exploit Cache coherence protocols
Already do almost what we need Invalidation Consistency checking Exploit Speculative execution Branch prediction = optimistic synch! Original TM proposal by Herlihy and Moss 1993… © 2006 Herlihy & Shavit
33
HW Transactional Memory
read active T caches Interconnect memory © 2006 Herlihy & Shavit
34
Transactional Memory read active active T T caches memory
© 2006 Herlihy & Shavit
35
Transactional Memory active active committed T T caches memory
© 2006 Herlihy & Shavit
36
Transactional Memory write active committed T D caches memory
© 2006 Herlihy & Shavit
37
Rewind write aborted active active T T D caches memory
© 2006 Herlihy & Shavit
38
Transaction Commit At commit point Mark transactional entries
If no cache conflicts, we win. Mark transactional entries Read-only: valid Modified: dirty (eventually written back) That’s all, folks! Except for a few details … © 2006 Herlihy & Shavit
39
Not all Skittles and Beer
Limits to Transactional cache size Scheduling quantum Transaction cannot commit if it is Too big Too slow Actual limits platform-dependent © 2006 Herlihy & Shavit
40
Some Approaches Trap to software if hardware fails
“Hybrid” approach Use locks if speculation fails Lock elision “Virtualize” transactional memory VTM, UTM, etc… © 2006 Herlihy & Shavit
41
Hardware versus Software
Do we need hardware at all? Like virtual memory, probably need HW for performance Do we need software? Policy issues don’t make sense for hardware © 2006 Herlihy & Shavit
42
Transactional Locking
Build STM based on locks Atomic{} simple as course grained locking and it composes Under the covers (transparent to user) we deploy fine grained locks © 2006 Herlihy & Shavit
43
Locking STM Design Choices
Map Array of Versioned- Write-Locks Application Memory V# PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object) © 2006 Herlihy & Shavit
44
Lock-based STM Mem Locks V# 0 V#+1 0 X Y V# 0 V# 0 V# 0
To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#’s unchanged Release each lock with v#+1 X V# V# V# V# V#+1 0 V# Y V# V# V# V# V#+1 0 V#+1 0 V# V# V# V# V# 0 V# © 2006 Herlihy & Shavit
45
Transactional Consistency
Invariant x = 2y 8 4 4 X Transaction A: Write x Write y 2 Y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory
46
Transactional Inconsistency
Invariant x = 2y 4 8 4 X Transaction B: Read x = 4 2 Y Transaction A: Write x Write y Zombie transactiosn are ones that are doomed to fail (here trans will fail validatuon at the end) but go on Computing in a way that can cause problems Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) Application Memory DIV by 0 ERROR
47
Solution: The “Global Clock”
Have one shared global clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent
48
Solution: “Version Clock”
Have one shared global version clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent Later: how we learned not to worry about contention and love the clock © 2006 Herlihy & Shavit
49
Version Clock: Read-OnlyTrans
Mem Locks 100 VClock V# RV VClock On Read: read lock, read mem, read lock: check unlocked, unchanged, and v# <= RV Commit. V# Reads form a snapshot of memory. No read set! V# 100 RV © 2006 Herlihy & Shavit
50
Version Clock 100 120 100 121 VClock Mem Locks 87 0 121 0 88 0 V# 0
V# RV VClock On Read/Write: check unlocked and v# <= RV then add to Read/Write-Set Acquire Locks WV = F&I(VClock) Validate each v# <= RV Release locks with v# WV X X Y Y V# Reads+Inc+Writes =Linearizable 100 RV Commit © 2006 Herlihy & Shavit
51
Performance Benchmarks
Mechanically Transformed Sequential Red-Black Tree using TL2 Compare to STMs and hand-crafted fine-grained Red-Black implementation On a 16–way Sun Fire™ running Solaris™ 10 © 2006 Herlihy & Shavit
52
Uncontended Large Red-Black Tree
Hand-crafted 5% Delete 5% Update 90% Lookup TL/PO TL2/P0 Ennals TL/PS TL2/PS FarserHarris Lock-free © 2006 Herlihy & Shavit
53
Uncontended Small RB-Tree
5% Delete 5% Update 90% Lookup TL/P0 TL2/P0 © 2006 Herlihy & Shavit
54
Contended Small RB-Tree
TL/P0 30% Delete 30% Update 40% Lookup TL2/P0 Ennals © 2006 Herlihy & Shavit
55
What Next? On multicores like Niagara Global Clock not a big issue.
But there is a lot of TM research work ahead Do you want to help…? © 2006 Herlihy & Shavit
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.