Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.

Slides:



Advertisements
Similar presentations
Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual.
Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
The Art of Multiprocessor Programming Nir Shavit, Ori Shalev CS Spring 2007 (Based on the book by Herlihy and Shavit)
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Transactional Locking Nir Shavit Tel Aviv University (Joint work with Dave Dice and Ori Shalev)
Introduction Companion slides for
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
The Future of Concurrency Theory Renaissance or Reformation? (Met dank aan Maurice Herlihy) Frits Vaandrager.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
The Future of Distributed Computing Renaissance or Reformation? Maurice Herlihy Brown University.
Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 at Penn, Spring.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
An Introduction to Software Transactional Memory
Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh
Art of Multiprocessor Programming 1 Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
WG5: Applications & Performance Evaluation Pascal Felber
Kernel Locking Techniques by Robert Love presented by Scott Price.
Transactional Memory R. Guerraoui, EPFL. Locking is ’’history’’ Lock-freedom is difficult.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Optimistic Design CDP 1. Guarded Methods Do something based on the fact that one or more objects have particular states Make a set of purchases assuming.
Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
Commutativity and Coarse-Grained Transactions Maurice Herlihy Brown University Joint work with Eric Koskinen and Matthew Parkinson (POPL 10)
Novel Paradigms of Parallel Programming Prof. Smruti R. Sarangi IIT Delhi.
Transactional Memory Companion slides for
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Multiprocessor Programming
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Part 2: Software-Based Approaches
PHyTM: Persistent Hybrid Transactional Memory
Transactional Memory Companion slides for
Transactional Memory TexPoint fonts used in EMF.
Multiprocessor Cache Coherency
Changing thread semantics
Lecture: Consistency Models, TM
Lecture 6: Transactions
Transactional Memory Companion slides for
Lecture 21: Transactional Memory
Lecture 22: Consistency Models, TM
Does Hardware Transactional Memory Change Everything?
Design and Implementation Issues for Atomicity
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Transactional Memory
I, for one, Welcome our new Multicore Overlords …
Parallel Data Structures
Presentation transcript:

Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit

From the New York Times … SAN FRANCISCO, May Intel said on Friday that it was scrapping its development of two microprocessors, a move that is a shift in the company's business strategy….

Moore’s Law (hat tip: Herb Sutter) Clock speed flattening sharply Transistor count still rising

Art of Multiprocessor Programming 4 Multicore Architetures “Learn how the multi-core processor architecture plays a central role in Intel's platform approach. ….” “AMD is leading the industry to multi-core technology for the x86 based computing market …” “Sun's multicore strategy centers around multi-threaded software.... “

© 2006 Herlihy & Shavit5 Why do we care? Time no longer cures software bloat When you double your path length –You can’t just wait 6 months –Your software must somehow exploit twice as much concurrency

© 2006 Herlihy & Shavit6 The Problem Cannot exploit cheap threads Today’s Software –Non-scalable methodologies Today’s Hardware –Poor support for scalable synchronization

© 2006 Herlihy & Shavit7 Locking

© 2006 Herlihy & Shavit8 Coarse-Grained Locking Easily made correct … But not scalable.

© 2006 Herlihy & Shavit9 Fine-Grained Locking Here comes trouble …

© 2006 Herlihy & Shavit10 Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use –Conservative –Deadlocks –Lost wake-ups Not Composable

© 2006 Herlihy & Shavit11 Locks are not Robust If a thread holding a lock is delayed … No one else can make progress

© 2006 Herlihy & Shavit12 Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use –Conservative –Deadlocks –Lost wake-ups Not Composable

© 2006 Herlihy & Shavit13 Locking Relies on Conventions Relation between –Lock bit and object bits –Exists only in programmer’s mind /* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */ Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul)

© 2006 Herlihy & Shavit14 Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use –Conservative –Deadlocks –Lost wake-ups Not Composable

© 2006 Herlihy & Shavit15 Sadistic Homework enq(x) enq(y) Double-ended queue No interference if ends “far enough” apart

© 2006 Herlihy & Shavit16 Sadistic Homework deq() ceq() Double-ended queue Interference OK if ends “close enough” together

© 2006 Herlihy & Shavit17 Sadistic Homework deq() Double-ended queue Make sure suspended dequeuers awake as needed 

In Search of the Lost Wake-Up Waiting thread doesn’t realize when to wake up It’s a real problem in big systems –“Calling pthread_cond_signal() or pthread_cond_broadcast() when the thread does not hold the mutex lock associated with the condition can lead to lost wake-up bugs.” from Google™ search for “lost wake-up”

© 2006 Herlihy & Shavit19 You Try It … One lock? –Too Conservative Locks at each end? –Deadlock, too complicated, etc Waking blocked dequeuers? –Harder than it looks

© 2006 Herlihy & Shavit20 Actual Solution Clean solution would be a publishable result [Michael & Scott, PODC 96] High performance fine-grained lock- based solutions are good for libraries… not general consumption by programmers

© 2006 Herlihy & Shavit21 Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use –Conservative –Deadlocks –Lost wake-ups Not Composable

© 2006 Herlihy & Shavit22 Locks do not compose add(T 1, item) delete(T 1, item) add(T 2, item) item Move from T 1 to T 2 Must lock T 2 before deleting from T 1 lock T2 lock T1 item Exposing lock internals breaks abstraction Hash Table Must lock T 1 before adding item

© 2006 Herlihy & Shavit23 Monitor Wait and Signal zzz If buffer is empty, wait for item to show up Empty buffer Yes!

© 2006 Herlihy & Shavit24 Wait and Signal do not Compose empty zzz… Wait for either?

© 2006 Herlihy & Shavit25 The Transactional Manifesto What we do now is inadequate to meet the multicore challenge Research Agenda –Replace locking with a transactional API –Design languages to support this model –Implement the run-time to be fast enough

© 2006 Herlihy & Shavit26 Transactions Atomic –Commit: takes effect –Abort: effects rolled back Usually retried Linearizable –Appear to happen in one-at-a-time order

© 2006 Herlihy & Shavit27 atomic { x.remove(3); y.add(3); } atomic { y = null; } Atomic Blocks

© 2006 Herlihy & Shavit28 atomic { x.remove(3); y.add(3); } atomic { y = null; } Atomic Blocks No data race

© 2006 Herlihy & Shavit29 Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Sadistic Homework Revisited (1) Write sequential Code

© 2006 Herlihy & Shavit30 Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Sadistic Homework Revisited (1)

© 2006 Herlihy & Shavit31 Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Sadistic Homework Revisited (1) Enclose in atomic block

© 2006 Herlihy & Shavit32 Warning Not always this simple –Conditional waits –Enhanced concurrency –Overlapping locks But often it is –Works for sadistic homework

© 2006 Herlihy & Shavit33 Public void Transfer(Queue q1, q2) { atomic { T x = q1.deq(); q2.enq(x); } Composition (1) Trivial or what?

© 2006 Herlihy & Shavit34 Public T LeftDeq() { atomic { if (this.left == null) retry; … } Wake-ups: lost and found (1) Roll back transaction and restart when something changes

© 2006 Herlihy & Shavit35 OrElse Composition atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1 st method. If it retries … Run 2 nd method. If it retries … Entire statement retries

Transactional Memory Software transactional memory (STM) Hardware transactional memory (HTM) Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful) © 2006 Herlihy & Shavit36

© 2006 Herlihy & Shavit37 Hardware versus Software Do we need hardware at all? –Analogies: Virtual memory: yes! Garbage collection: no! –Probably do need HW for performance Do we need software? –Policy issues don’t make sense for hardware

Design Issues Implementation choices Language design issues Semantic issues

Granularity Object –managed languages, Java, C#, … –Easy to control interactions between transactional & non-trans threads Word –C, C++, … –Hard to control interactions between transactional & non-trans threads

Direct/Deferred Update Deferred –modify private copies & install on commit –Commit requires work –Consistency easier Direct –Modify in place, roll back on abort –Makes commit efficient –Consistency harder

Conflict Detection Eager –Detect before conflict arises –“Contention manager” module resolves Lazy –Detect on commit/abort Mixed –Eager write/write, lazy read/write …

Conflict Detection Eager detection may abort transaction that could have committed. Lazy detection discards more computation.

Contention Managers Oracle that resolves conflicts –For eager conflict detection CM decides –Whether to abort other transaction –Or give it a chance to finish …

Contention Manager Strategies Exponential backoff Oldest gets priority Most work done gets priority Non-waiting has priority over waiting Lots of alternatives –None seems to dominate –But choice can have big impact

I/O & System Calls? Some I/O revocable –Provide transaction-safe libraries –Undoable file system/DB calls Some not –Opening cash drawer –Firing missile

I/O & System Calls One solution: make transaction irrevocable –If transaction tries I/O, switch to irrevocable mode. There can be only one … –Requires serial execution No explicit aborts –In irrevocable transactions

Exceptions int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); }

Exceptions int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } Throws OutOfMemoryException!

Exceptions int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } Throws OutOfMemoryException! What is printed?

Unhandled Exceptions Aborts transaction –Preserves invariants –Safer Commits transaction –Like locking semantics –What if exception object refers to values modified in transaction?

Nested Transactions atomic void foo() { bar(); } atomic void bar() { … } atomic void foo() { bar(); } atomic void bar() { … }

Nested Transactions Needed for modularity –Who knew that cosine() contained a transaction? Flat nesting –If child aborts, so does parent First-class nesting –If child aborts, partial rollback of child only

Open Nested Transactions Normally, child commit –Visible only to parent In open nested transactions –Commit visible to all –Escape mechanism –Dangerous, but useful What escape mechanisms are needed?

Privatization Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times

Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Privatization “privatize” node

Privatization Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Thread 0 Access without transactional synchronization

Transactions Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times atomic { Node n = head; while (n != null) { n.val++; n = n.next; } atomic { Node n = head; while (n != null) { n.val++; n = n.next; } Thread 0 Thread 1

Transactions Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times atomic { Node n = head; while (n != null) { n.val++; n = n.next; } atomic { Node n = head; while (n != null) { n.val++; n = n.next; } Thread 0 Thread 1 Is this safe?

Transactions Thread 0 Thread 1 Is this safe? For locks, yes. Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times atomic { Node n = head; while (n != null) { n.val++; n = n.next; } atomic { Node n = head; while (n != null) { n.val++; n = n.next; }

Transactions Thread 0 Thread 1 Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times Node n; atomic { n = head; if (n != null) head = head.next; } // use n.val many times atomic { Node n = head; while (n != null) { n.val++; n = n.next; } atomic { Node n = head; while (n != null) { n.val++; n = n.next; } Is this safe? For locks, yes. For STMs, maybe not.

Strong vs Weak Isolation Non-transactional accesses don’t synchronize Non-transactional thread may see delayed updates/recovery Solutions –Privatization barrier? –Other techniques?

Single Global Lock Semantics Transactions act as if it acquires SGL Good: –Intuitively appealing Bad: –What about aborted transactions? –May be expensive to implement

Simple Lock-Based STM STMs come in different forms –Lock-based –Lock-free Here we will describe a simple lock- based STM

Synchronization Transaction keeps –Read set: locations & values read –Write set: locations & values to be written Deferred update –Changes installed at commit Lazy conflict detection –Conflicts detected at commit

© 2006 Herlihy & Shavit65 STM: Transactional Locking Map Array of Versioned- Write-Locks Application Memory V#

© 2006 Herlihy & Shavit66 Reading an Object Put V#s & value in RS If not already locked Mem Locks V#

© 2006 Herlihy & Shavit67 To Write an Object Add V# and new value to WS Mem Locks V#

© 2006 Herlihy & Shavit68 To Commit Acquire W locks Check V#s unchanged In RS & WS Install new values Increment V#s Release … Mem Locks V# X Y V#+1

Transactional Consistency Application Memory x y Invariant x = 2y Transaction A: Write x Write y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2

Zombies! A transaction that will abort because of a synchronization conflict But doesn’t know it yet And keeps running

Zombies! Application Memory x y Invariant x = 2y Transaction A: Write x (kills B) Write y Transaction B: Read x = 4 Transaction B: (zombie) Read y = 4 Compute z = 1/(x-y) DIV by 0 ERROR

Solution: The “Global Clock” Have one shared global clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent

© 2006 Herlihy & Shavit73 Read-Only Transactions Copy V Clock to RV Read lock,V# Read mem Check unlocked Recheck V# unchanged Check V# < RV Mem Locks Shared Version Clock Private Read Version 100 Reads form a snapshot of memory. No read set!

© 2006 Herlihy & Shavit74 Regular Transactions Copy V Clock to RV On read/write, check: Unlocked V# ≤ RV Add to R/W set Mem Locks 100 Shared Version Clock Private Read Version

© 2006 Herlihy & Shavit75 Regular Transactions Acquire locks WV = F&Inc(V Clock) Check each V# ≤ RV Update memory Set write V#s to WV Mem Locks 100 Shared Version Clock Private Read Version x y

© 2006 Herlihy & Shavit76 Hardware Transactional Memory Exploit Cache coherence protocols Already do almost what we need –Invalidation –Consistency checking Exploit Speculative execution –Branch prediction = optimistic synch!

© 2006 Herlihy & Shavit77 HW Transactional Memory Interconnect caches memory read active T

© 2006 Herlihy & Shavit78 Transactional Memory caches memory read active TT

© 2006 Herlihy & Shavit79 Transactional Memory caches memory active TT committed

© 2006 Herlihy & Shavit80 Transactional Memory caches memory write active committed TD

© 2006 Herlihy & Shavit81 Rewind caches memory active TT write aborted D

© 2006 Herlihy & Shavit82 Transaction Commit At commit point –If no cache conflicts, we win. Mark transactional entries –Read-only: valid –Modified: dirty (eventually written back) That’s all, folks! –Except for a few details …

© 2006 Herlihy & Shavit83 Not all Skittles and Beer Limits to –Transactional cache size –Scheduling quantum Transaction cannot commit if it is –Too big –Too slow –Actual limits platform-dependent

Approaches Virtualization –Spill transactional cache to memory Hybrid –Use limited HTM as HW accelerator for STM –Sun’s Rock™ processor …

Back to Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! Must parallelize applications on a very fine grain! So…How will we make use of multicores?

Art of Multiprocessor Programming 86 Traditional Scaling Process User code Traditional Uniprocessor Speedup 1.8x 7x 3.6x Time: Moore’s law

Art of Multiprocessor Programming 87 Multicore Scaling Process User code Multicore Speedup 1.8x7x3.6x Unfortunately, not so simple…

Art of Multiprocessor Programming 88 Actual Scaling Process 1.8x 2x 2.9x User code Multicore Speedup Parallelization and Synchronization require great care…

Need Ways to Make Scaling Smoother… TM code Speedup 1.8x7x3.6x Multicore User code

I, for one, Welcome our new Multicore Overlords … Multicore architectures force us to rethink how we do synchronization Standard locking model won’t work Transactional model might –Software –Hardware A full-employment act!