Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”

Traditional Software Scaling 2 User code Traditional Uniprocessor Speedup 1.8x 7x 3.6x Time: Moore’s law

Multicore Software Scaling 3 User code Multicore Speedup 1.8x7x3.6x Unfortunately, not so simple…

Real-World Multicore Scaling 4 1.8x 2x 2.9x User code Multicore Speedup Parallelization and Synchronization require great care…

Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! Why? Effect of 25% becomes more accute as num of cores grows 2.3/4, 2.9/8, 3.4/16, 3.7/32………4/ ∞

Shared Data Structures 75% Unshared 25% Shared cc cc cc cc Coarse Grained c c c c c c c c cc cc cc cc Fine Grained c c c c c c c c The reason we get only 2.9 speedup 75% Unshared 25% Shared Fine grained parallelism has huge performance benefit

object Shared Memory Concurrent Programming How do we lower the granularity of synchronization without making the concurrent programmer’s life unbearable?!

A FIFO Queue bcd TailHead a Enqueue(d)Dequeue() => a

A Concurrent FIFO Queue Object lock bcd TailHead a P: Dequeue() => a Q: Enqueue(d) Simple Code, easy to prove correct Contention and sequential bottleneck

Fine Grain Locks bcd TailHead a P: Dequeue() => a Q: Enqueue(d) Finer Granularity, More Complex Code Verification nightmare: worry about deadlock, livelock…

Fine Grain Locks bcd TailHead a P: Dequeue() => a Q: Enqueue(b) b TailHead a Worry how to acquire multiple locks Complex boundary cases: empty queue, last item

Lock-Free (JDK 1.5+) bcd TailHead a P: Dequeue() => a Q: Enqueue(d) Even Finer Granularity, Even More Complex Code Worry about starvation, subtle bugs, hardness to modify…

Real Applications Complex: Move data atomically between structures More than twice the worry… bcd TailHead a P: Dequeue(Q1,a) cda TailHead b Enqueue(Q2,a)

Transactional Memory [HerlihyMoss93]

Promise of Transactional Memory bcd TailHead a P: Dequeue() => a Q: Enqueue(d) Don’t worry about deadlock, livelock, subtle bugs, etc… Great Performance, Simple Code

Promise of Transactional Memory bcd TailHead a P: Dequeue() => a Q: Enqueue(d) b TailHead a TM deals with boundary cases under the hood Don’t worry which locks need to cover which variables when…

For Real Applications Will be easy to modify multiple structures atomically Provide Serializability… bcd TailHead a P: Dequeue(Q1,a) cda TailHead b Enqueue(Q2,a)

Using Transactional Memory enqueue (Q, newnode) { Q.tail-> next = newnode Q.tail = newnode }

Using Transactional Memory enqueue (Q, newnode) { atomic{ Q.tail-> next = newnode Q.tail = newnode }

Transactions Will Solve Many of Locks’ Problems No need to think what needs to be locked, what not, and at what granularity No worry about deadlocks and livelocks No need to think about read-sharing Can compose concurrent objects in a way that is safe and scalable But there are problems!

Hardware TM [HerlihyMoss93] Hardware Transactions 20-30…but not ~1000 instructions long Diff Machines… expect different hardware support Hardware is not flexible…abort policies, retry policies, all application dependent…

Software Transactional Memory [ShavitTouitou94] The semantics of hardware transactions…today Tomorrow: serve as a standard interface to hardware Allow to extend hardware features when they arrive Still, we need to have reasonable performance… Today’s focus…

The Brief History of STM 1994 STM (Shavit,Touitou) 2003 DSTM (Herlihy et al) 2003 WSTM (Fraser, Harris) Lock-free 2003 OSTM (Fraser, Harris) 2004 ASTM (Marathe et al) 2004 T-Monitor (Jagannathan…) Obstruction-freeLock-based 2005 Lock-OSTM (Ennals) 2004 HybridTM (Moir) 2004 Meta Trans (Herlihy, Shavit) 2005 McTM (Saha et al) 2006 AtomJava (Hindman…) 1997 Trans Support TM (Moir) 2005 TL1/2 (Dice, Shavit)) 2004 Soft Trans (Ananian, Rinard) 2007-9…New lock based STMs from IBM, Intel, Sun, Microsoft

As Good As Fine Grained Locking Postulate (i.e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory. Implication: Lets try to provide STMs that get as close as possible to hand-crafted fine-grained locking.

Subliminal Cut 25

Transactional Consistency Memory Transactions are collections of reads and writes executed atomically Tranactions should maintain internal and external consistency –External: with respect to the interleavings of other transactions. –Internal: the transaction itself should operate on a consistent state.

External Consistency Application Memory X Y 4 2 8 4 Invariant x = 2y Transaction A: Write x Write y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/4

Locking STM Design Choices PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object) Map Array of Versioned- Write-Locks Application Memory V#

Encounter Order Locking (Undo Log) 1.To Read: load lock + location 2.Check unlocked add to Read-Set 3.To Write: lock location, store value 4.Add old value to undo-set 5.Validate read-set v#’s unchanged 6.Release each lock with v#+1 V# 0 X V# 1 V# 0 Y V# 1 V# 0 Mem Locks V#+1 0 V# 0 V#+1 0 V# 0 V#+1 0 V# 0 X Y Quick read of values freshly written by the reading transaction [Ennals,Saha,Harris,TinySTM,SwissTM…] Blue code does not change memory, red does

Commit Time Locking (Write Log) 1.To Read: load lock + location 2.Location in write-set? (Bloom Filter) 3.Check unlocked add to Read-Set 4.To Write: add value to write set 5.Acquire Locks 6.Validate read/write v#’s unchanged 7.Release each lock with v#+1 V# 0 Mem Locks V#+1 0 V# 0 Hold locks for very short duration V# 1 X Y V#+1 0 V# 1 V#+1 0 V# 0 V#+1 0 V# 0 V#+1 0 V# 0 X Y [TL,TL2,SkySTM…]

COM vs. ENC High Load ENC Hand Lock COM Red-Black Tree 20% Delete 20% Update 60% Lookup

COM vs. ENC Low Load COM ENC Hand Lock Red-Black Tree 5% Delete 5% Update 90% Lookup

Problem: Internal Inconsistency A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Internal Inconsistency Application Memory X Y 4 2 8 4 Invariant x = 2y Transaction A: Write x Write y Transaction B: Read x = 4 Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) DIV by 0 ERROR

Past Approaches 1.Design STMs that allow internal inconsistency. 2.To detect zombies introduce validation into user code at fixed intervals or loops, used traps, OS support 3.Still there are cases where zombie’s cannot be detected  infinite loops in user code…

Global Clock [TL2/SnapIsolation] Have a shared global version clock Incremented by writing transactions (as infrequently as possible) Read by all transactions Used to validate that the state viewed by a transaction is always consistent [DiceShalevShavit06/ReigelFelberFetzer06]

TL2 Version Clock: Read-Only Trans 1.RV  VClock 2.To Read: read lock, read mem, read lock, check unlocked, unchanged, and v# <= RV 3.Commit. 87 0 34 0 88 0 V# 0 44 0 V# 0 34 0 99 0 50 0 Mem Locks Reads form a snapshot of memory. No read set! 100 Vclock (shared) 87 0 34 0 99 0 50 0 87 0 34 0 88 0 V# 0 44 0 V# 0 99 0 50 0 100 RV (private)

TL2 Version Clock: Writing Trans 1.RV  VClock 2.To Read/Write: check unlocked and v# <= RV then add to Read/Write-Set 3.Acquire Locks 4.WV = F&I(VClock) 5.Validate each v# <= RV 6.Release locks with v#  WV Reads+Inc+Writes =serializable 100 VClock 87 0 34 0 88 0 44 0 V# 0 34 0 99 0 50 0 Mem Locks 87 0 34 0 99 0 50 0 34 1 99 1 87 0 X Y Commit 121 0 50 0 87 0 121 0 88 0 V# 0 44 0 V# 0 121 0 50 0 100 RV 100120121 X Y

How we learned to stop worrying and love the clock Version clock rate is a progress concern, not a safety concern, so.. –(GV4) if failed to increment VClock using CAS use VClock set by winner –(GV5) use WV = VClock + 2; inc VClock on abort –(GV7) localized clocks… [AvniShavit08]

Uncontended Large Red-Black Tree 5% Delete 5% Update 90% Lookup Hand- crafted TL/PS TL2/PS Ennals Fraser Harris Lock- free TL/PO TL2/P0

Contended Small RB-Tree 30% Delete 30% Update 40% Lookup Ennals TL/P0 TL2/P0

Locking performs well > #cores… TL/PS TL/P0 16 Processors Farser Harris Lock- free

Implicit Privatization [Menon et al] In real apps: often want to “privatize” data Then operate on it non-transactionally Many STMs (like TL2) based on “Invisible Readers” Invisible Readers/Writers are a problem if we want implicit privatization…

Privatization Pathology b cd a P: atomically{ a.next = c; } // b is private b.value = 0; P P: atomically{ a.next = c; } // b is private b.value = 0; 0 P: atomically{ a.next = c; } // b is private b.value = 0; P privatizes node b then modifies it non-transactionally

Privatization Pathology b cd a P: atomically{ a.next = c; } // b is private b.value = 0; Q: atomically{ tmp = a.next; foo = (1/tmp.value) } P P: atomically{ a.next = c; } // b is private b.value = 0; Q: atomically{ tmp = a.next; foo = (1/tmp.value) } 0 P: atomically{ a.next = c; } // b is private b.value = 0; Q Invisible reader Q cannot detect non-transactional modification to node b Q: atomically{ tmp = a.next; foo = (1/tmp.value) } Q: divide by 0 error

Solving the Privatization Problem Visible Writers –Reads are made aware of overlapping writes [DiceShavit07/M4,Gottschlich Connors07, SpearMichaelVonPraun08 …] Visible Readers –Writes are made aware of overlapping reads [EllenLevLuchangcoMoir07/SNZI, DiceShavit09/Bytelocks…] b P Q

Visible Readers Use read-write locks. Transactions also lock to read. Privatization is immediate… But RW-locks will make us burn in coherence traffic hell: CAS to increment/decrement reader-count Which is why we had invisible readers in the first place  Or is it ?

A new read-write lock for multicores Common case: no CAS, only store + membar to read Claim: on modern multicores cost of coherent stores not too bad… Read-Write Bytelocks [DiceShavit09] Map Array of read-write byte-locks a bytelock Application Memory

49 The ByteLock Lock Record Writer ID Visible readers : –Reader count for unslotted threads CAS to increment and decrement –Reader array for slotted threads Array of atomically addressable bytes 48 or 112 slots, Write + Membar to Modify wrtidrdcnt Single $ line 0 1 0 0 1 Slots 1 23 4 5 a byte per slot traditional Writer idcounter for unslotted

50 ByteLock Write 030 1 0 0 1 Mem 1 23 4 5 wrtidrdcnt Writer i i CAS Intel, AMD, Sun read 8 or 16 bytes at a time Spin until all 0 Writers wait till readers drain out X

51 Slotted Read 030 1 0 0 1 Mem 1 23 4 5 wrtidrdcnt Slotted Reader i Readers give pref to writers 1 No Writer? 0 Read Mem On Intel, AMD, Sun store to byte + membar is very fast Release: simple store

52 Slotted Read Slow-Path i30 1 0 0 1 Mem 1 23 4 5 wrtidrdcnt Slotted Reader i Intel, AMD, Sun store to byte + membar is very fast Readers give pref to writers 1 If non-0 retry 0 Spin until non- 0 then retry

53 Unslotted Read i30 1 0 1 Mem 1 23 4 5 wrtidrdcnt Unslotted Reader i 4 CAS If non-0 Unslotted readers like in traditional RW 3 Decrement using CAS and wait for writer to go away

54 Transact 2009 Mutex TLRW Bytelock 128 slot ByteLock Performance TLRW Inc/dec read counters TLRW Bytelock 48 slot TL2- GV6 PS

Where we are heading… A lot more work on performance –Visible writers, visible readers Think GC, game just begun –Improve single threaded –Amazing possibilities for compiler optimization –OS support Explosion of new STMs –~100 TM papers in last couple of years

A bit further down the road… Transactional Languages –No Implicit Privatization Problem… –Composability And when hardware TM arrives… –Contention management –New possibilities for extending and interfacing…

Need Experience with Apps Today –MSF, Quake, Apache, FenixEDU (Large Distributed App),Student Trials in Germany and US… Need a lot more transactification of applications –Not just rewriting of concurrent apps –But actually applications that are parallelized from scratch using TM

Thanks!

Visible Global Writing 1.To Write Lock clock then release and increment clock 2.To Read check if clock has changed, if so abort Reads+Inc+Writes =serializable 100 VClock Mem X 100 101 X

Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”

Similar presentations

Presentation on theme: "Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”

Similar presentations

Presentation on theme: "Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”"— Presentation transcript:

Similar presentations

About project

Feedback