Transactional Memory Lecture 19:

Slides:



Advertisements
Similar presentations
Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Transaction Management
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
Hardware and Software transactional memory and usages in MRE
Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.
SYSTEMS IMPLEMENTATION TECHNIQUES TRANSACTION PROCESSING DATABASE RECOVERY DATABASE SECURITY CONCURRENCY CONTROL.
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Software Coherence Management on Non-Coherent-Cache Multicores
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Atomic Operations in Hardware
Transaction Management Overview
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Multiprocessor Cache Coherency
Transaction Management Overview
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 19: Transactional Memories III
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Lecture 11: Transactional Memory
Transaction Management Overview
Lecture: Consistency Models, TM
Lecture 6: Transactions
Chapter 10 Transaction Management and Concurrency Control
Lecture 21: Transactional Memory
Lecture 21: Synchronization and Consistency
Shared Memory Programming
Lecture 22: Consistency Models, TM
Lecture: Coherence and Synchronization
Hybrid Transactional Memory
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Lecture 4: Synchronization
Lecture 20: Intro to Transactions & Logging II
Kernel Synchronization II
Database Recovery 1 Purpose of Database Recovery
The University of Adelaide, School of Computer Science
Transaction Management
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
Lecture 24: Multiprocessors
Transaction Management Overview
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Lecture: Transactional Memory
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Controlled Interleaving for Transactions
Transaction Management Overview
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Transactional Memory Lecture 19: Credit: many of the slides in today’s talk are borrowed from Professor Christos Kozyrakis (Stanford University, now EPFL)

Raising level of abstraction for synchronization Previous topic: machine-level atomic operations Fetch-and-op, test-and-set, compare-and-swap, load linked-store conditional Then we used these atomic operations to construct higher level synchronization primitives in software: Locks, barriers We’ve seen how it can be challenging to produce correct programs using these primitives (easy to create bugs that violate atomicity, create deadlock, etc.) Today: raising level of abstraction for synchronization even further Idea: transactional memory synchronization is hard to get right!

What you should know What a transaction is The difference (in semantics) between an atomic code block and lock/unlock primitives The basic design space of transactional memory implementations Data versioning policy Conflict detection policy Granularity of detection The basics of a hardware implementation of transactional memory (consider how it relates to the cache coherence protocol implementations we’ve discussed previously in the course)

Review: ensuring atomicity via locks void deposit(Acct account, int amount) { int tmp = bank.get(account); tmp += amount; bank.put(account, tmp); } lock(account.lock); If you got a deposit in between the .get() and .put(), you’d be pissed unlock(account.lock); Deposit is a read-modify-write operation: want “deposit” to be atomic with respect to other bank operations on this account Locks are one mechanism to synchronize threads to ensure atomicity of update (via ensuring mutual exclusion on the account)

Programming with transactions void deposit(Acct account, int amount) { lock(account.lock); int tmp = bank.get(account); tmp += amount; bank.put(account, tmp); unlock(account.lock); } void deposit(Acct account, int amount) { atomic { int tmp = bank.get(account); tmp += amount; bank.put(account, tmp); } OpenMP. Implemented using locks under the hood Atomic construct is declarative Programmer states what to do (maintain atomicity of this code), not how to do it No explicit use or management of locks System implements synchronization as necessary to ensure atomicity System could implement atomic { } using a lock Implementation discussed today uses optimistic concurrency: serialization only in situations of true contention (R-W or W-W conflicts)

Declarative vs. imperative abstractions Declarative: programmer defines what should be done Execute all these independent 1000 tasks Imperative: programmer states how it should be done Spawn N worker threads. Assign work to threads by removing work from a shared task queue draw distinction between declarative and imperative tried to make this difference clear when talking about parallel execution today the same applies to synchronization example from your earlier programming languages (OpenMP) Q. ask for examples. Perform this set of operations atomically Acquire a lock, perform operations, release the lock

Transactional Memory (TM) Memory transaction An atomic and isolated sequence of memory accesses Inspired by database transactions Atomicity (all or nothing) Upon transaction commit, all memory writes in transaction take effect at once On transaction abort, none of the writes appear to take effect (as if transaction never happened) Isolation No other processor can observe writes before transaction commits Serializability (or Consistency) Transactions appear to commit in a single serial order But the exact order of commits is not guaranteed by semantics of transaction system is going to preserve these properties other processors don’t see writes until the transaction commits all memory operations commit at the SAME time EVERYTHING THAT WE USED TO ACHIEVE WITH ONE MEMORY ACCESS. NOT APPLIES TO A BLOCK ACID: Atomicity, Consistency, Isolation, Durability

Transactional Memory (TM) In other words… many of the properties we maintained for a single address in a coherent memory system, we’d like to maintain for sets of reads and writes in a transaction. Transaction: Reads: X, Y, Z Writes: A, X These memory transactions will either all be observed by other processors, or none of them will. (they effectively all happen at the same time)

Load-linked, store conditional (LL/SC) LL/SC is a light version of transactional memory Pair of corresponding instructions (not a single atomic instruction like compare-and-swap) load_linked(x): load value from address store_conditional(x, value): store value to x, if x hasn’t been written to since corresponding LL Corresponding ARM instructions: LDREX and STREX How might LL/SC be implemented on a cache coherent processor? ARM: exclusive monitors Line better be in monitor register or in S state in cache when SC starts SC fails if it’s not, without a write on bus. DON’T LET LINES FALL OUT IN BETWEEN LL-SC Like the ticket lock, only one write occurs on release

Motivating transactional memory

Another example: Java HashMap Map: Key → Value Implemented as a hash table with linked list per bucket public Object get(Object key) { int idx = hash(key); // compute hash HashEntry e = buckets[idx]; // find bucket while (e != null) { // find element in bucket if (equals(key, e.key)) return e.value; e = e.next; } return null; hash table implemented with a linked list in every bucket Bad: not thread safe (when synchronization needed) Good: no lock overhead when synchronization not needed

Synchronized HashMap Java 1.4 solution: synchronized layer Convert any map to thread-safe variant Uses explicit, coarse-grained locking specified by programmer synchronized is basically a lock. Those are the semantics!!! you just don’t have to worry about releasing it NOTE, this seems naive, but it’s basically what done in practice. PYTHON INTERPRETER public Object get(Object key) { synchronized (myHashMap) { // guards all accesses to hashMap return myHashMap.get(key); } Coarse-grain synchronized HashMap Good: thread-safe, easy to program Bad: limits concurrency, poor scalability

Review from earlier fine-grained sync lecture What are solutions for making Java’s HashMap thread-safe? public Object get(Object key) { int idx = hash(key); // compute hash HashEntry e = buckets[idx]; // find bucket while (e != null) { // find element in bucket if (equals(key, e.key)) return e.value; e = e.next; } return null; One solution: use finer-grained synchronization (e.g., lock per bucket) Now thread safe: but incurs lock overhead even if synchronization not needed

Review: performance of fine-grained locking Reducing contention via fine-grained locking leads to better performance Hash-Table Balanced Tree

Transactional HashMap Simply enclose all operation in atomic block Semantics of atomic block: system ensures atomicity of logic within block note this is not going to be the same as a lock, will explain why shortly This is just saying the operation needs to be atomic, not mutually exclusive SYSTEM MIGHT BE ABLE TO FIGURE IT OUT Q. REVIEW: WHAT’S THE DIFFERENCE BETWEEN A LOCK, AND THIS CODE? public Object get(Object key) { atomic { // System guarantees atomicity return m.get(key); } Transactional HashMap Good: thread-safe, easy to program What about performance and scalability? Depends on the workload and implementation of atomic (to be discussed)

Another example: tree update by two threads Goal: modify nodes 3 and 4 in a thread-safe way 1 Set it up: two threads… To see how this might work, let’s take a simple example: a tree modified in parallel by many threads. Traditionally, if you wanted to modify this node you would use some sort of hand-over-hand locking scheme to try and achieve scalability. 2 3 4 Slide credit: Austen McDonald

Slide credit: Austen McDonald Fine-grained locking example Goal: modify nodes 3 and 4 in a thread-safe way Hand-over-hand locking 1 2 3 4 Slide credit: Austen McDonald

Slide credit: Austen McDonald Fine-grained locking example Goal: modify nodes 3 and 4 in a thread-safe way Hand-over-hand locking 1 2 3 4 Slide credit: Austen McDonald

Slide credit: Austen McDonald Fine-grained locking example Goal: modify nodes 3 and 4 in a thread-safe way Hand-over-hand locking 1 2 3 4 Slide credit: Austen McDonald

Slide credit: Austen McDonald Fine-grained locking example Goal: modify nodes 3 and 4 in a thread-safe way Hand-over-hand locking 1 2 3 4 Slide credit: Austen McDonald

Slide credit: Austen McDonald Fine-grained locking example Goal: modify nodes 3 and 4 in a thread-safe way Hand-over-hand locking 1 2 3 4 Slide credit: Austen McDonald

Locking can prevent concurrency Slide credit: Austen McDonald Fine-grained locking example Goal: modify nodes 3 and 4 in a thread-safe way Hand-over-hand locking 1 conclusion on slides Performance is degraded even in seemingly parallel cases 2 3 4 Locking can prevent concurrency (here: locks on node 1 and 2 during update to node 3 could delay update to 4) Slide credit: Austen McDonald

Slide credit: Austen McDonald Transactions example Figure highlights data touched as part of transaction 1 highlight data touched as part of transaction 2 3 4 Transaction A READ: 1, 2, 3 Slide credit: Austen McDonald

Slide credit: Austen McDonald Transactions example Figure highlights data touched as part of transaction 1 2 3 4 Transaction A READ: 1, 2, 3 WRITE: 3 Slide credit: Austen McDonald

Slide credit: Austen McDonald Transactions example Figure highlights data touched as part of transaction 1 NO CONFLICTS Transactions allow fine-grained concurrency 2 3 4 Transaction B NO READ-WRITE or WRITE-WRITE conflicts! (no transaction writes to data that is accessed by other transactions) Transaction A READ: 1, 2, 3 READ: 1, 2, 4 WRITE: 3 WRITE: 4 Slide credit: Austen McDonald

Slide credit: Austen McDonald Transactions example #2 (Both transactions modify node 3) 1 now let’s say second transaction modifies 3 instead of 4 2 3 4 Transaction B Conflicts exist: transactions must be serialized (both transactions write to node 3) Transaction A READ: 1, 2, 3 READ: 1, 2, 3 WRITE: 3 WRITE: 3 Slide credit: Austen McDonald

Performance: locks vs. transactions “TCC” is a TM system implemented in hardware HashMap Balanced Tree

Failure atomicity: locks void transfer(A, B, amount) { synchronized(bank) { try { // What if A invalid or balance too low? withdraw(A, amount); // What if B invalid? deposit(B, amount); } catch(withdraw_exception) { /* undo code 1*/ } catch(deposit_exception) { /* undo code 2*/ } … ANOTHER MOTIVATION: This is not really an example specific to parallelism for some reason cannot complete the deposit authentication failure. Account is closed. Tricky roll back code based on all the things that can go wrong Need to make sure you release locks, etc. Complexity of manually catching exceptions Programmer provides “undo” code on a case-by-case basis Complexity: must track what to undo and how… Some side-effects may become visible to other threads E.g., an uncaught case can deadlock the system…

Failure atomicity: transactions void transfer(A, B, amount) { atomic { withdraw(A, amount); deposit(B, amount); } System now responsible for processing exceptions All exceptions (except those explicitly managed by the programmer) Transaction is aborted and memory updates are undone Recall: a transaction either commits or it doesn’t: no partial updates are visible to other threads E.g., no locks held by a failing threads…

* Yes, I was particularly proud of this one. Composability: locks void transfer(A, B, amount) { synchronized(A) { synchronized(B) { withdraw(A, amount); deposit(B, amount); } Thread 0: transfer(x, y, 100); Thread 1: transfer(y, x, 100); DEADLOCK! GREAT EXAMPLE make things atomic take a lock on the source account, then take a lock on the destination account and then go Q. HOW TO FIX? (1) ORDER POLICY (2) TAKE ONE BIG LOCK, FOR EXAMPLE, ON A BANK, IS FINE COULD HAVE A KNOWN POLICY. TAKE LOCK ON MIN FIRST, Composing lock-based code can be tricky Requires system-wide policies to get correct System-wide policies can break software modularity Programmer caught between an extra lock and a hard (to implement) place * Coarse-grain locks: low performance Fine-grain locking: good for performance, but can lead to deadlock * Yes, I was particularly proud of this one.

Composability: locks Composing lock-based code can be tricky void transfer(A, B, amount) { synchronized(A) { synchronized(B) { withdraw(A, amount); deposit(B, amount); } void transfer2(A, B, amount) { synchronized(B) { synchronized(A) { withdraw(A, 2*amount); deposit(B, 2*amount); } DEADLOCK! GREAT EXAMPLE make things atomic take a lock on the source account, then take a lock on the destination account and then go Q. HOW TO FIX? TAKE ONE BIG LOCK, FOR EXAMPLE, ON A BANK, IS FINE COULD HAVE A KNOWN POLICY. TAKE LOCK ON MIN FIRST, Composing lock-based code can be tricky Requires system-wide policies to get correct System-wide policies can break software modularity Programmer caught between an extra lock and a hard (to implement) place Coarse-grain locks: low performance Fine-grain locking: good for performance, but can lead to deadlock

Composability: transactions void transfer(A, B, amount) { atomic { withdraw(A, amount); deposit(B, amount); } Thread 0: transfer(x, y, 100) Thread 1: transfer(y, x, 100); It’s sort of like the coherence protocol. Only kicks in if it matters. REALLY IMPORTANT! Transactions compose gracefully (in theory) Programmer declares global intent (atomic execution of transfer) No need to know about global implementation strategy Transaction in transfer subsumes any defined in withdraw and deposit Outermost transaction defines atomicity boundary System manages concurrency as well as possible serialization Serialization for transfer(A, B, 100) and transfer(B, A, 200) Concurrency for transfer(A, B, 100) and transfer(C, D, 200) 25

Advantages (promise) of transactional memory Easy to use synchronization construct It is difficult for programmers to get synchronization right Programmer declares need for atomicity, system implements it well Claim: transactions are as easy to use as coarse-grain locks Often performs as well as fine-grained locks Provides automatic read-read concurrency and fine-grained concurrency Performance portability: locking scheme for four CPUs may not be the best scheme for 64 CPUs Productivity argument for transactional memory: system support for transactions can achieve 90% of the benefit of expert programming with fined-grained locks, with 10% of the development time Failure atomicity and recovery No lost locks when a thread fails Failure recovery = transaction abort + restart Composability Safe and scalable composition of software modules

Atomic { } ≠ lock() + unlock() The difference Atomic: high-level declaration of atomicity Does not specify implementation of atomicity Lock: low-level blocking primitive Does not provide atomicity or isolation on its own Keep in mind Locks can be used to implement an atomic block but… Locks can be used for purposes beyond atomicity Cannot replace all uses of locks with atomic regions Atomic eliminates many data races, but programming with atomic blocks can still suffer from atomicity violations: e.g., programmer erroneous splits sequence that should be atomic into two atomic blocks Make sure you understand this difference in semantics! important slide

Example integration with OpenMP (Proposed) Example: OpenTM = OpenMP + TM OpenMP: Easy to specify parallel loops and tasks TM: atomic and isolation execution Easy to specify synchronization and speculation OpenTM features Transactions, transactional loops and transactional sections Data directives for TM (e.g., thread private data) Runtime system hints for TM Code example: this is a histogram example so different iterations are not independent vs. #pragma atomic on the update #pragma omp transaction { #pragma omp parallel for for (int i=0; i<N; i++) { bin[A[i]]++; }

What about replacing synchronized with atomic in this example? // Thread 2 synchronized(lock2) { … flagB = true; while (flagA == 0); } // Thread 1 synchronized(lock1) { … flagA = true; while (flagB == 0); } thread one sets B, then waits for A thread two sets A, then waits for B, no problems, right. But you wouldn’t want to turn these different locks into an atomic region. NO PROGRESS This code is REALLY A BARRIER (not trying to achieve atomicity)

Atomicity violation due to programmer error // Thread 1 atomic { … ptr = A; } B = ptr->field; // Thread 2 atomic { … ptr = NULL; } atomic doesn’t mean you can’t screw up Programmer mistake: logically atomic code sequence (in thread 1) is erroneously separated into two atomic blocks (allowing another thread to set pointer to NULL in between)

Implementing transactional memory

Recall transactional semantics Atomicity (all or nothing) At commit, all memory writes take effect at once In event of abort, none of the writes appear to take effect Isolation No other code can observe writes before commit Serializability Transactions seem to commit in a single serial order The exact order is not guaranteed though system is going to preserve these properties other processors don’t see writes until the transaction commits all memory operations commit at the SAME time

TM implementation basics TM systems must provide atomicity and isolation Without sacrificing concurrency Basic implementation requirements Data versioning (ALLOWS transaction to abort) Conflict detection and resolution (WHEN to abort) Implementation options Hardware transactional memory (HTM) Software transactional memory (STM) Hybrid transactional memory e.g., hardware-accelerated STMs ABLE TO ABORT need to keep versions of data around. before the transaction, and state during transaction (need to be able to support a roll back) WHEN TO ABORT

Design Tradeoffs Versioning: Eager vs. Lazy At what point are updates made? Eager: Right away Lazy: As part of commit Conflict Detection: Optimistic vs Pessimistic When does check for conflicts occur? Optimistic: Wait until commit Pessimistic: Check each operation Contention Management: Aggressive vs. Polite What should be done when a conflict is detected? Aggressive: Force other transaction(s) to abort Polite: Wait until other transactions complete Usually use something in between

Data versioning Manage uncommitted (new) and previously committed (old) versions of data for concurrent transactions Eager versioning (undo-log based) Lazy versioning (write-buffer based) first talk about versioning, since that gives the ability to roll back forget the WHEN for now

Eager versioning (Immediate update) Update memory immediately, maintain “undo log” in case of abort Begin Transaction Write x ← 15 Memory Thread (executing transaction) X: 10 Undo log Memory Thread (executing transaction) X: 15 Undo log X: 10 WE CALL IT EAGER BECAUSE... go ahead and commit the data to memory (note, memory can be in cache) THIS WILL WORK WITH PESSIMISTIC (check for conflict on my writes), NOT PRACTICAL WITH OPTIMISTIC Commit Transaction Abort Transaction Memory Thread (executing transaction) X: 10 Undo log Memory Thread (executing transaction) X: 15 Undo log X: 10

Lazy versioning (Deferred update) Log memory updates in transaction write buffer, flush buffer on commit Begin Transaction Write x ← 15 Memory Thread (executing transaction) X: 10 Write buffer Memory Thread (executing transaction) X: 10 Write buffer X: 15 CALL IT LAZY BECAUSE WE ARE ONLY GOING TO WRITE TO MEMORY WHEN WE HAVE TO SEE HOW IT WILL MUST EASIER TO ABORT, SINCE MEMORY IS NOT UPDATED Commit Transaction Abort Transaction Memory Thread (executing transaction) X: 15 Write buffer Memory Thread (executing transaction) X: 10 Write buffer X: 15

Data versioning Manage uncommitted (new) and committed (old) versions of data for concurrent transactions Eager versioning (undo-log based) Update memory location directly on write Maintain undo information in a log (incurs per-store overhead) Good: faster commit (data is already in memory) Bad: slower aborts, fault tolerance issues (consider crash in middle of transaction) Lazy versioning (write-buffer based) Buffer data in a write buffer until commit Update actual memory location on commit Good: faster abort (just clear log), no fault tolerance issues Bad: slower commits Eager versioning philosophy: write to memory immediately, hoping transaction won’t abort (but deal with aborts when you have to) EAGER: “write, hoping you won’t abort —- deal with it later” per store penalty: two writes LAZY: “write only when you have to” writes are one write, but take hit on commit Lazy versioning philosophy: only write to memory when you have to

Conflict detection Must detect and handle conflicts between transactions Read-write conflict: transaction A reads address X, which was written to by pending transaction B Write-write conflict: transactions A and B are both pending, and both write to address X System must track a transaction’s read set and write set Read-set: addresses read within the transaction Write-set: addresses written within the transaction read->write. Don’t want to transaction 2 to READ what transaction 1 WRITES, until the transaction 1 is done MIGHT READ DATA STRUCTURE IN AN INTERMEDIATE STATE (CONSIDER BANK TRANSFER EXAMPLE)

Pessimistic detection Check for conflicts during loads or stores A HW implementation will check for conflicts through coherence actions (will discuss in detail later) Philosophy: “I suspect conflicts might happen, so let’s always check to see if one has occurred after each memory operation… if I’m going to have to roll back, might as well do it now to avoid wasted work.” “Contention manager” decides to stall or abort transaction when a conflict is detected Various priority policies to handle common case fast check immediately and after each operation: pessimistic = assume badness has/will happen, so go ahead and roll back now

Pessimistic detection examples (Note: diagrams assume “aggressive” contention manager on writes: writer wins) Case 1 Case 2 Case 3 Case 4 T0 T1 T0 T1 T0 T1 T0 T1 Time wr A rd A DEFAULT: aggressive CM. requesting action WINS or STALLS (if it thinks it will roll back) CASE 1: SUCCESS: easy CASE 2: EARLY DETECT: X1 reads data written by pending transaction. CAN ABORT, OR JUST STALL FOR A BIT (hasn’t read yet, so it can wait) CASE 3: ABORT: T0... oh you did write it? Shit. I need to restart because I read data, and so the rest of my reads won’t be atomic if T1 commits (NOTE if restart immediately, a case 2 situation will occur because the RD A by the restarted T0 will cause a stall. CASE 4: Aggressive CM. rd A wr A check check check check rd A wr A wr B wr A check check check restart check stall rd A wr C restart commit check check wr A stall (case 2) check commit restart commit commit commit wr A check commit restart Success Early detect (and stall) Abort No progress (question: how to avoid livelock?)

Optimistic detection Detect conflicts when a transaction attempts to commit HW: validate write set using coherence actions Get exclusive access for cache lines in write set Intuition: “Let’s hope for the best and sort out all the conflicts only when the transaction tries to commit” On a conflict, give priority to committing transaction Other transactions may abort later on On conflicts between committing transactions, use contention manager to decide priority Note: can use optimistic and pessimistic schemes together Several STM systems use optimistic for reads and pessimistic for writes

Optimistic detection (Must use lazy versioning) Case 1 Case 2 Case 3 Case 4 T0 T1 T0 T1 T0 T1 T0 T1 Time rd A rd A rd A wr A wr A DEFAULT: committing action wins CASE 1: trivial, no conflict CASE 2: ABORT: oops, T1 read something in the middle of the transaction, start over since future reads aren’t atomic CASE 3: T0 can commit because T0 hasn’t “seen” the update of A. (T0 “wins”) CASE 4: ditto. Committer WINS wr A wr B rd A rd A wr A commit check commit wr C commit check check restart restart commit check rd A rd A commit wr A check commit check commit commit check check Success Abort Success Forward Progress

Conflict detection trade-offs Pessimistic conflict detection (Unfortunately) some references call this “eager” Good: Detect conflicts early (undo less work, turn some aborts to stalls) Bad: no forward progress guarantees, more aborts in some cases Bad: fine-grained communication (check on each load/store) Bad: detection on critical path Optimistic conflict detection (Unfortunately) some references call this “lazy” Good: forward progress guarantees Good: bulk communication and conflict detection Bad: detects conflicts late, can still have fairness problems

Conflict detection granularity Object granularity (SW-based techniques) Good: reduced overhead (time/space) Good: close to programmer’s reasoning Bad: false sharing on large objects (e.g. arrays) Machine word granularity Good: minimize false sharing Bad: increased overhead (time/space) Cache-line granularity Good: compromise between object and word Can mix and match to get best of both worlds Word-level for arrays, object-level for other data, …

TM implementation space (examples) Hardware TM systems Lazy + optimistic: Stanford TCC, Intel VTM Lazy + pessimistic: MIT LTM Eager + pessimistic: Wisconsin LogTM Eager + optimistic: not practical Software TM systems Lazy + optimistic (rd/wr): Sun TL2 Lazy + optimistic (rd)/pessimistic (wr): MS OSTM Eager + optimistic (rd)/pessimistic (wr): Intel STM Eager + pessimistic (rd/wr): Intel STM Optimal design remains an open question May be different for HW, SW, and hybrid Q. Why is eager-pessimistic practical? Check the undo log of the other transactions to detect conflicts Q. Why is eager-optimistic not practical? eager=write to memory immediately. optimistic=detect conflicts at commit time. (unclear how you check, since previous X-action has committed and there’s no undo log?)

Hardware transactional memory (HTM) Data versioning is implemented in caches Cache the write buffer or the undo log Add new cache line metadata to track transaction read set and write set Conflict detection through cache coherence protocol Coherence lookups detect conflicts between transactions Works with snooping and directory coherence Note: Register checkpoint must also be taken at transaction begin (to restore execution context state on abort)

HTM design Cache lines annotated to track read set and write set R bit: indicates data read by transaction (set on loads) W bit: indicates data written by transaction (set on stores) R/W bits can be at word or cache-line granularity R/W bits gang-cleared on transaction commit or abort For eager versioning, need a 2nd cache write for undo log Coherence requests check R/W bits to detect conflicts Observing shared request to W-word is a read-write conflict Observing exclusive (intent to write) request to R-word is a write-read conflict Observing exclusive (intent to write) request to W-word is a write-write conflict SHOWN HERE, FOR EVERY WORD This illustration tracks read and write set at word granularity MESI state bit for line (e.g., M state) . . . M Tag R W Word 1 R W Word N

Example HTM implementation: lazy-optimistic CPU Registers ALUs TM State Cache V Tag Data CPU changes Ability to checkpoint register state (available in many CPUs) TM state registers (status, pointers to abort handlers, …)

Example HTM implementation: lazy-optimistic CPU Registers ALUs TM State Cache W R V Tag Data Cache changes R bit indicates membership to read set W bit indicates membership to write set

HTM transaction execution Xbegin Load A Load B Store C ⇐ 5 Xcommit CPU Registers ALUs TM State Cache W R V Tag Data 1 C 9 Transaction begin Initialize CPU and cache state Take register checkpoint

HTM transaction execution Xbegin Load A Load B Store C ⇐ 5 Xcommit CPU Registers ALUs TM State Cache W R V Tag Data 1 C 9 1 A 33 1 1 A Load operation Serve cache miss if needed Mark data as part of read set

HTM transaction execution Xbegin Load A Load B Store C ⇐ 5 Xcommit CPU Registers ALUs TM State Cache W R V Tag Data 1 1 1 C 9 B 1 A 33 1 1 A Load operation Serve cache miss if needed Mark data as part of read set

HTM transaction execution Xbegin Load A Load B Store C ⇐ 5 Xcommit CPU Registers ALUs NOTE BIG DIFFERENCE FROM MSI/MESI. THIS IS FOR A WRITE, BUT NOT LOADING EXCLUSIVE! TM State Cache W R V Tag Data 1 1 1 C 9 B 1 A 33 1 1 A 1 B 5 1 1 C Store operation Service cache miss if needed Mark data as part of write set (note: this is not a load into exclusive state. Why?)

HTM transaction execution: commit Xbegin Load A Load B Store C ⇐ 5 Xcommit CPU Registers ALUs TM State Cache W R V Tag Data 1 1 1 C 9 B 1 A 33 1 1 A upgradeX C (result: C is now in exclusive-dirty state) 1 B 5 1 1 C Fast two-phase commit Validate: request RdX access to write set lines (if needed) Commit: gang-reset R and W bits, turns write set data to valid (dirty) data

HTM transaction execution: detect/abort Assume remote processor commits transaction with writes to A and D Xbegin Load A Load B Store C ⇐ 5 Xcommit CPU Registers ALUs D from another processor. Not a problem A is a problem. Since it conflicts with local read TM State Cache W R V Tag Data 1 1 1 C 9 B coherence requests from another core’s commit (remote core’s write of A conflicts with local read of A: triggers abort of pending local transaction)  1 A 33 1 1 A upgradeX A 1 B 5 1 1 C  upgradeX D Fast conflict detection and abort Check: lookup exclusive requests in the read set and write set Abort: invalidate write set, gang-reset R and W bits, restore to register checkpoint

Hardware transactional memory support in Intel Haswell architecture * Transactional Synchronized Extensions (TSX) New instructions for “restricted transactional memory” (RTM) xbegin: takes pointer to “fallback address” in case of abort e.g., fallback to code-path with a spin-lock xend xabort Implementation: tracks read and write set in L1 cache Processor makes sure all memory operations commit atomically But processor may automatically abort transaction for many reasons (e.g., eviction of line in read or write set will cause a transaction abort). Implementation does not guarantee progress (see fallback address) Intel optimization guide (ch 12) gives guidelines for increasing probability that transactions will not abort http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (chapter 12) * Shipped with bug that caused Intel disable it when discovered in 2014, fixed in Broadwell arch chips

GCC Support _xbegin(), _xend(), _xabort() + macros #include <immintrin.h> int n_tries, max_tries; unsigned status = _XABORT_EXPLICIT; ... for (n_tries = 0; n_tries < max_tries; n_tries++) { status = _xbegin (); if (status == _XBEGIN_STARTED || !(status & _XABORT_RETRY)) break; } if (status == _XBEGIN_STARTED) { ... transaction code... _xend (); } else { ... non-transactional fallback path...

TSX does not guarantee progress Transactions fail for many reasons Writing fallback paths still require locks The fallback path most overlap with the transaction The lock path must prevent transactions from committing For example: Result status = _xbegin(); else { if (status == SUCCESS) { /* Fall back path */ if (_stop_the_world) { lock(); _xabort(); _stop_the_world = true; } ... _stop_the_world = false; _xend(); unlock(); Results collected by Mario Dehesa-Azuara and Nick Stanley as Spring 2016 project

TSX Performance TSX can only track a limited number of locations Minimize memory touched Transactions have a cost Approximately equal to the cost of six atomic primitives to the same cache line Results collected by Mario Dehesa-Azuara and Nick Stanley as Spring 2016 project

Summary: transactional memory Atomic construct: declaration of atomic behavior Motivating idea: increase simplicity of synchronization, without (significantly) sacrificing performance Transactional memory implementation Many variants have been proposed: SW, HW, SW+HW Implementations differ in: Versioning policy (eager vs. lazy) Conflict detection policy (pessimistic vs. optimistic) Detection granularity Hardware transactional memory Versioned data is kept in caches Conflict detection mechanisms built upon coherence protocol