Transactional Memory Companion slides for

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Transactional Locking Nir Shavit Tel Aviv University (Joint work with Dave Dice and Ori Shalev)
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
MACHINE-INDEPENDENT VIRTUAL MEMORY MANAGEMENT FOR PAGED UNIPROCESSOR AND MULTIPROCESSOR ARCHITECTURES R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
The Future of Concurrency Theory Renaissance or Reformation? (Met dank aan Maurice Herlihy) Frits Vaandrager.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
The Future of Distributed Computing Renaissance or Reformation? Maurice Herlihy Brown University.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
An Introduction to Software Transactional Memory
Art of Multiprocessor Programming 1 Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
WG5: Applications & Performance Evaluation Pascal Felber
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
Transactional Memory Companion slides for
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Multiprocessor Programming
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Part 2: Software-Based Approaches
PHyTM: Persistent Hybrid Transactional Memory
Transactional Memory Companion slides for
Transactional Memory TexPoint fonts used in EMF.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory By McKenney, Michael, Triplett and Walpole.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Faster Data Structures in Transactional Memory using Three Paths
Concurrent Objects Companion slides for
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
Example Cache Coherence Problem
Changing thread semantics
Lecture: Consistency Models, TM
Lecture 6: Transactions
Lecture 21: Transactional Memory
Lecture 22: Consistency Models, TM
Lecture: Consistency Models, TM
Does Hardware Transactional Memory Change Everything?
Hybrid Transactional Memory
Kernel Synchronization II
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Consistency Models, TM
Lecture: Transactional Memory
The University of Adelaide, School of Computer Science
Software Engineering and Architecture
I, for one, Welcome our new Multicore Overlords …
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Art of Multiprocessor Programming 1

Traditional Software Scaling 7x Speedup 3.6x 1.8x User code Traditional Uniprocessor Recall the traditional scaling process for software: write it once, trust Intel to make the CPU faster to improve performance. Time: Moore’s law

Multicore Software Scaling Speedup 1.8x 7x 3.6x User code With multicores, we will have to parallelize the code to make software faster, and we cannot do this automatically (except in a limited way on the level of individual instructions). Multicore Unfortunately, not so simple…

Real-World Multicore Scaling Speedup 2.9x 2x 1.8x User code Multicore This is because splitting the application up to utilize the cores is not simple, and coordination among the various code parts requires care. Parallelization and Synchronization require great care…

Why? As num cores grows the effect of 25% becomes more accute Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! As num cores grows the effect of 25% becomes more accute 2.3/4, 2.9/8, 3.4/16, 3.7/32….

Shared Data Structures Fine grained parallelism has huge performance benefit The reason we get only 2.9 speedup Coarse Grained Fine Grained c 25% Shared 25% Shared c c c The hard part about parallelizing applications is taking care of the fraction that is has to do with communication among threads, usually through shared objects… 75% Unshared 75% Unshared

A FIFO Queue a b c d Head Tail Dequeue() => a Enqueue(d) So lets look at what we have learned about parallelizing such objects…the game is making things more fine grained… Dequeue() => a Enqueue(d)

A Concurrent FIFO Queue Simple Code, easy to prove correct b c d Tail Head a P: Dequeue() => a Q: Enqueue(d) Object lock Don’t undersell simplicity… Contention and sequential bottleneck

Fine Grain Locks Finer Granularity, More Complex Code a b c d Head Tail a b c d P: Dequeue() => a Q: Enqueue(d) Verification nightmare: worry about deadlock, livelock…

Fine Grain Locks Complex boundary cases: empty queue, last item Tail Head a Head Tail a b c d P: Dequeue() => a Q: Enqueue(b) Worry how to acquire multiple locks

Moreover: Locking Relies on Conventions Relation between Lock bit and object bits Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) /* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */ Art of Multiprocessor Programming 11

Lock-Free (JDK 1.5+) Even Finer Granularity, Even More Complex Code a Head Tail a b c d We can go even one step further…we have seen how to design lock free data structures…with finer granularity P: Dequeue() => a Q: Enqueue(d) Worry about starvation, subtle bugs, hardness to modify…

Composing Objects More than twice the worry… Complex: Move data atomically between structures b c d Tail Head a P: Dequeue(Q1,a) c d a Tail Head b Enqueue(Q2,a) More than twice the worry…

Transactional Memory Great Performance, Simple Code Head Tail a b c d P: Dequeue() => a Q: Enqueue(d) Don’t worry about deadlock, livelock, subtle bugs, etc…

Promise of Transactional Memory Don’t worry which locks need to cover which variables when… b Tail Head a Head Tail a b c d P: Dequeue() => a Q: Enqueue(d) TM deals with boundary cases under the hood

Composing Objects Provide Composability… Will be easy to modify multiple structures atomically b c d Tail Head a P: Dequeue(Q1,a) c d a Tail Head b Enqueue(Q2,a) Provide Composability…

The Transactional Manifesto Current practice inadequate to meet the multicore challenge Research Agenda Replace locking with a transactional API Design languages to support this model Implement the run-time to be fast enough Art of Multiprocessor Programming 17

Art of Multiprocessor Programming Transactions Atomic Commit: takes effect Abort: effects rolled back Usually retried Serizalizable Appear to happen in one-at-a-time order Art of Multiprocessor Programming 18

Art of Multiprocessor Programming Atomic Blocks atomic { x.remove(3); y.add(3); } atomic { y = null; } Art of Multiprocessor Programming 19

Art of Multiprocessor Programming Atomic Blocks atomic { x.remove(3); y.add(3); } atomic { y = null; } No data race Art of Multiprocessor Programming 20 20

Art of Multiprocessor Programming Designing a FIFO Queue Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code Art of Multiprocessor Programming 21 21

Art of Multiprocessor Programming Designing a FIFO Queue Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Art of Multiprocessor Programming 22

Enclose in atomic block Designing a FIFO Queue Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Enclose in atomic block Art of Multiprocessor Programming 23

Art of Multiprocessor Programming Warning Not always this simple Conditional waits Enhanced concurrency Complex patterns But often it is Works for sadistic homework Art of Multiprocessor Programming 24

Art of Multiprocessor Programming Composition Public void Transfer(Queue<T> q1, q2) { atomic { T x = q1.deq(); q2.enq(x); } Trivial or what? Art of Multiprocessor Programming 25 25

Roll back transaction and restart when something changes Public T LeftDeq() { atomic { if (this.left == null) retry; … } Roll back transaction and restart when something changes Art of Multiprocessor Programming 26

OrElse Composition Run 1st method. If it retries … atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries Art of Multiprocessor Programming 27

Art of Multiprocessor Programming Transactional Memory Software transactional memory (STM) Hardware transactional memory (HTM) Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful) Art of Multiprocessor Programming 28

Hardware versus Software Do we need hardware at all? Analogies: Virtual memory: yes! Garbage collection: no! Probably do need HW for performance Do we need software? Policy issues don’t make sense for hardware Art of Multiprocessor Programming 29

Transactional Consistency Memory Transactions are collections of reads and writes executed atomically Tranactions should maintain internal and external consistency External: with respect to the interleavings of other transactions. Internal: the transaction itself should operate on a consistent state. When we design STMs, we need to worry about both external and internal consistency

Art of Multiprocessor Programming External Consistency Invariant x = 2y 8 4 4 x Transaction A: Write x Write y 2 y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory Art of Multiprocessor Programming

Art of Multiprocessor Programming Simple Lock-Based STM STMs come in different forms Lock-based Lock-free Here we will describe a simple lock-based STM Lets see how we design an STM to provide external consistency… Art of Multiprocessor Programming

Art of Multiprocessor Programming Synchronization Transaction keeps Read set: locations & values read Write set: locations & values to be written Deferred update Changes installed at commit Lazy conflict detection Conflicts detected at commit Art of Multiprocessor Programming

STM: Transactional Locking Map V# Array of Versioned- Write-Locks Application Memory V# V# Art of Multiprocessor Programming 34

Art of Multiprocessor Programming Reading an Object Mem Locks V# Put V#s & value in RS If not already locked V# V# V# V# Art of Multiprocessor Programming 35

Art of Multiprocessor Programming To Write an Object Mem Locks V# Add V# and new value to WS V# V# V# V# Art of Multiprocessor Programming 36

Art of Multiprocessor Programming To Commit Mem Locks V# Acquire W locks Check V#s unchanged In RS & WS Install new values Increment V#s Release … X V# V#+1 V# V# Y V#+1 V# Art of Multiprocessor Programming 37

Problem: Internal Inconsistency A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Art of Multiprocessor Programming Internal Consistency Invariant x = 2y 4 8 4 x Transaction B: Read x = 4 2 y Transaction A: Write x (kills B) Write y Zombie transactiosn are ones that are doomed to fail (here trans will fail validatuon at the end) but go on Computing in a way that can cause problems Transaction B: (zombie) Read y = 4 Compute z = 1/(x-y) Application Memory DIV by 0 ERROR Art of Multiprocessor Programming

Solution: The “Global Clock” Have one shared global clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent Art of Multiprocessor Programming

Read-Only Transactions Mem Locks Copy V Clock to RV Read lock,V# Read mem Check unlocked Recheck V# unchanged Check V# < RV 12 Reads form a snapshot of memory. No read set! 32 56 100 Shared Version Clock 100 19 17 Private Read Version (RV) Art of Multiprocessor Programming 41

Regular Transactions Mem Locks Copy V Clock to RV On read/write, check: Unlocked V# ≤ RV Add to R/W set 12 32 56 19 100 Shared Version Clock 100 17 Private Read Version (RV) Art of Multiprocessor Programming 42

Regular Transactions x y Mem Locks Acquire locks WV = F&Inc(V Clock) Check each V# ≤ RV Update memory Set write V#s to WV 12 x 100 32 56 19 101 100 100 y 100 17 Private Read Version (RV) Shared Version Clock Art of Multiprocessor Programming 43

Hardware Transactional Memory Exploit Cache coherence Already almost does it Invalidation Consistency checking Speculative execution Branch prediction = optimistic synch! Original TM proposal by Herlihy and Moss 1993… Art of Multiprocessor Programming 44

HW Transactional Memory read active T caches Interconnect memory Art of Multiprocessor Programming 45

Art of Multiprocessor Programming Transactional Memory read active active T T caches memory Art of Multiprocessor Programming 46

Art of Multiprocessor Programming Transactional Memory active committed active T T caches memory Art of Multiprocessor Programming 47

Art of Multiprocessor Programming Transactional Memory write committed active T D caches memory Art of Multiprocessor Programming 48

Art of Multiprocessor Programming Rewind write aborted active active T T D caches memory Art of Multiprocessor Programming 49

Art of Multiprocessor Programming Transaction Commit At commit point If no cache conflicts, we win. Mark transactional entries Read-only: valid Modified: dirty (eventually written back) That’s all, folks! Except for a few details … Art of Multiprocessor Programming 50

Not all Skittles and Beer Limits to Transactional cache size Scheduling quantum Transaction cannot commit if it is Too big Too slow Actual limits platform-dependent Art of Multiprocessor Programming 51

Art of Multiprocessor Programming TM Design Issues Implementation choices Language design issues Semantic issues Art of Multiprocessor Programming

Art of Multiprocessor Programming Granularity Object managed languages, Java, C#, … Easy to control interactions between transactional & non-trans threads Word C, C++, … Hard to control interactions between transactional & non-trans threads Art of Multiprocessor Programming

Direct/Deferred Update modify private copies & install on commit Commit requires work Consistency easier Direct Modify in place, roll back on abort Makes commit efficient Consistency harder Art of Multiprocessor Programming

Art of Multiprocessor Programming Conflict Detection Eager Detect before conflict arises “Contention manager” module resolves Lazy Detect on commit/abort Mixed Eager write/write, lazy read/write … Art of Multiprocessor Programming

Art of Multiprocessor Programming Conflict Detection Eager detection may abort transaction that could have committed. Lazy detection discards more computation. Art of Multiprocessor Programming

Contention Management & Scheduling How to resolve conflicts? Who moves forward and who rolls back? Lots of empirical work but formal work in infancy Art of Multiprocessor Programming

Contention Manager Strategies Exponential backoff Priority to Oldest? Most work? Non-waiting? None Dominates But needed anyway Judgment of Solomon Art of Multiprocessor Programming

Art of Multiprocessor Programming I/O & System Calls? Some I/O revocable Provide transaction-safe libraries Undoable file system/DB calls Some not Opening cash drawer Firing missile Art of Multiprocessor Programming

Art of Multiprocessor Programming I/O & System Calls One solution: make transaction irrevocable If transaction tries I/O, switch to irrevocable mode. There can be only one … Requires serial execution No explicit aborts In irrevocable transactions Art of Multiprocessor Programming

Art of Multiprocessor Programming Exceptions int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); Art of Multiprocessor Programming

Exceptions Throws OutOfMemoryException! int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); Art of Multiprocessor Programming

Exceptions Throws OutOfMemoryException! int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); What is printed? Art of Multiprocessor Programming

Art of Multiprocessor Programming Unhandled Exceptions Aborts transaction Preserves invariants Safer Commits transaction Like locking semantics What if exception object refers to values modified in transaction? Art of Multiprocessor Programming

Art of Multiprocessor Programming Nested Transactions atomic void foo() { bar(); } atomic void bar() { … Art of Multiprocessor Programming

Art of Multiprocessor Programming Nested Transactions Needed for modularity Who knew that cosine() contained a transaction? Flat nesting If child aborts, so does parent First-class nesting If child aborts, partial rollback of child only Art of Multiprocessor Programming

Open Nested Transactions Normally, child commit Visible only to parent In open nested transactions Commit visible to all Escape mechanism Dangerous, but useful What escape mechanisms are needed? Art of Multiprocessor Programming

Strong vs Weak Isolation How do transactional & non-transactional threads synchronize? Similar to memory-model theory? Efficient algorithms? Art of Multiprocessor Programming

I, for one, Welcome our new Multicore Overlords … Multicore forces us to rethink almost everything Art of Multiprocessor Programming

I, for one, Welcome our new Multicore Overlords … Multicore forces us to rethink almost everything Standard approaches too complex Art of Multiprocessor Programming

I, for one, Welcome our new Multicore Overlords … Multicore forces us to rethink almost everything Standard approaches won’t scale Transactions might make life simpler… Art of Multiprocessor Programming

I, for one, Welcome our new Multicore Overlords … Multicore forces us to rethink almost everything Standard approaches won’t scale Transactions might … Plenty more to do Art of Multiprocessor Programming