1 Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Maurice Herlihy (DEC), J. Eliot & B. Moss (UMass)
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Transactional Memory: Architectural Support for Lock- Free Data Structures Herlihy & Moss Presented by Robert T. Bauer.
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
Distributed Systems 2006 Styles of Client/Server Computing.
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Multiprocessor Cache Coherency
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
Transactional Memory CDA6159. Outline Introduction Paper 1: Architectural Support for Lock-Free Data Structures (Maurice Herlihy, ISCA ‘93) Paper 2: Transactional.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Transactional Memory Lecturer: Danny Hendler. 2 2 From the New York Times…
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Transactional Memory Student Presentation: Stuart Montgomery CS5204 – Operating Systems 1.
18 September 2008CIS 340 # 1 Last Covered (almost)(almost) Variety of middleware mechanisms Gain? Enable n-tier architectures while not necessarily using.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Advanced Operating Systems (CS 202) Transactional memory Jan, 27, 2016 slide credit: slides adapted from several presentations, including stanford TCC.
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Concurrency Control.
Multiprocessor Cache Coherency
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 6: Transactions
Lecture 21: Transactional Memory
Transactional Memory An Overview of Hardware Alternatives
Part 1: Concepts and Hardware- Based Approaches
Chapter 15 : Concurrency Control
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Consistency Models, TM
Lecture: Transactional Memory
The University of Adelaide, School of Computer Science
CSE 542: Operating Systems
Presentation transcript:

1 Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007

2 References M. Herlihy and J. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, Sean Lie: Unbounded Transactional Memory. Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004). “Transactional Memory Coherence and Consistency”

3 Today What are transactions? What is Hardware Transactional Memory? Various implementations of HTM

4 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

5 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

6 Lock-free A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object.

7 Lock-free data structures avoid common problems associated with conventional locking techniques in highly concurrent systems: – Priority inversion – Convoying occurs when a process holding a lock is descheduled, and then, other processes capable of running may be unable to progress. – Deadlock Lock-free (cont)

8 Priority inversion Priority inversion occurs when a lower-priority process is preempted while holding a lock needed by higher-priority processes.

9 Deadlock Deadlock – two or more processes are waiting indefinitely for an event that can be caused by only one of waiting processes. Let S and Q be two resources P 0 P 1 Lock(S) Lock(Q) Lock(Q) Lock(S)

10 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

11 What is a transaction? A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts If a transaction commits, all the loads and stores appear to have executed atomically If a transaction aborts, none of its stores take effect Transaction operations aren't visible until they commit or abort

12 Transactions properties: A transaction satisfies the following properties: – Serializability – Atomicity Simplified version of traditional ACID database (Atomicity, Consistency, Isolation, and Durability)

13 Transactional Memory A new multiprocessor architecture The goal: Implementing a lock-free synchronization – efficient – easy to use comparing to conventional techniques based on mutual exclusion Implemented by straightforward extensions to multiprocessor cache-coherence protocols.

14 An Example Locks : if (i<j) { a = i; b = j; } else { a = j; b = i; } Lock(L[a]); Lock(L[b]); Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; Unlock(L[b]); Unlock(L[a]); Transactional Memory: StartTransaction; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; EndTransaction;

15 Transactional Memory Transactions execute in commit order ld 0xdddd... st 0xbeef Transaction A Time ld 0xbeef Transaction C ld 0xbeef Re-execute with new data Commit ld 0xdddd... ld 0xbbbb Transaction B Commit Violation! 0xbeef

16 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

17 Cache-Coherence Protocol A protocol for managing the caches of a multiprocessor system: – No data is lost – No overwritten before the data is transferred from a cache to the target memory. When multiprocessing, each processor may have its own memory cache that is separate from the shared memory

18 The Problem (Cache-Coherence) Solving the problem in either of two ways: – directory-based – snooping system

19 Snoopy Cache All caches watches the activity (snoop) on a global bus to determine if they have a copy of the block of data that is requested on the bus.

20 Directory-based The data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.

21 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

22 How it Works? The following primitive instructions for accessing memory are provided: Load-transactional (LT): reads value of a shared memory location into a private register. Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified. Store-transactional (ST) tentatively writes a value from a private register to a shared memory location. Commit (COMMIT) Abort (ABORT) Validate (VALIDATE) tests the current transaction status.

23 Some definitions Read set: the set of locations read by LT by a transaction Write set: the set of locations accessed by LTX or ST by a transaction Data set (footprints): the union of the read and write sets. A set of values in memory is inconsistent if it couldn’t have been produced by any serial execution of transactions

24 Intended Use Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would: 1. use LT or LTX to read from a set of locations 2. use VALIDATE to check that the values read are consistent, 3. use ST to modify a set of locations 4. use COMMIT to make the changes permanent. If either the VALIDATE or the COMMIT fails, the process returns to Step (1).

25 Implementation Transactional memory is implemented by modifying standard multiprocessor cache coherence protocols We describe here how to extend “snoopy” cache protocol for a shared bus to support transactional memory Our transactions are short-lived activities with relatively small Data set.

26 The basic idea Any protocol capable of detecting accessibility conflicts can also detect transaction conflict at no extra cost Once a transaction conflict is detected, it can be resolved in a variety of ways

27 Implementation Each processor maintains two caches – Regular cache for non-transactional operations, – Transactional cache for transactional operations. It holds all the tentative writes, without propagating them to other processors or to main memory (until commit) Why using two caches?

28 Cache line states Each cache line (regular or transactional) has one of the following states: The transactional cache expends these states:

29 Cleanup When the transactional cache needs space for a new entry, it searches for: – EMPTY entry – If not found - a NORMAL entry – finally for an XCOMMIT entry.

30 Processor actions Each processor maintains two flags: – The transaction active (TACTIVE) flag: indicates whether a transaction is in progress – The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False) Non-transactional operations behave exactly as in original cache-coherence protocol

31 Example – LT operation: Look for XABORT entry Return it’s value Look for NORMAL entry Change it to XABORT and allocate another XCOMMIT entry Found? Not Found? Ask to read this block from the shared memory Found? Not Found? Successful read Create two entries: XABORT and XCOMMIT Unsuccessful read Abort the transaction:  TSTATUS=FALSE  Drop XABORT entries  All XCOMMIT entries are set to NORMAL Cache miss

32 Snoopy cache actions: Both the regular cache and the transactional cache snoop on the bus. A cache ignores any bus cycles for lines not in that cache. The transactional cache’s behavior: – If TSTATUS=False, or if the operation isn’t transactional, the cache acts just like the regular cache, but ignores entries with state other than NORMAL – On LT of other cpu, if the state is VALID, the cache returns the value, and for all other transactional operations it returns BUSY

33 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

34 Simulation We’ll see an example code for the producer/consumer algorithm using transactional memory architecture. The simulation runs on both cache coherence protocols: snoopy and directory cache. The simulation use 32 processors The simulation finishes when 2^16 operations have completed.

35 Part Of Producer/Consumer Code typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE]; } queue; unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result; }

36 The results:

37 In both HTM and STM the transactions shouldn’t touch many memory locations There is a (small) bound on the transactions footprint In addition, there is a duration limit. So Far:

38 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

39 UTM – new thesis: supports transactions of arbitrary footprint and duration. The UTM architecture allows: – transactions as large as virtual memory – transactions of unlimited duration – transactions which can migrate between processors UTM supports a semantics for nested transactions In contrast to previous HTM implementation: UTM is optimized for transactions below a certain size but still operate correctly for larger transactions Unbounded Transactional Memory (UTM)

40 The Goal of UTM The primary goal: – make concurrent programming easier. – Reducing implementation overhead. Why do we want unbounded TM? Neither programmers nor compilers can easily cope with an imposed hard limit on transaction size.

41 UTM architecture The transaction log – data structure that maintains bookkeeping information for a transaction Why is it needed? – Enables transactions to survive time slice interrupts – Enables process migration from one processor to another.

42 Two new instructions All the programmer must specify is where a transaction begins and ends XBEGIN pc – Begin a new transaction. Entry point to an abort handler specified by pc. – If transaction must fail, roll back processor and memory state to what it was when XBEGIN was executed, and jump to pc. – We can think of an XBEGIN instruction as a conditional branch to the abort handler. XEND – End the current transaction. If XEND completes, the transaction is committed and appeared atomic. – Nested transactions are subsumed into outer transaction.

43 Transaction Semantics XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XEND L2: XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 XEND Two transactions: – “A” has an abort handler at L1 – “B” has an abort handler at L2 Here, very simplistic retry. A B

44 A name dependence occurs when two instructions Inst1 and Inst2 use the same register (or memory location), but there is no data transmitted between Inst1 and Inst2. If the register is renamed so that Inst1 and Inst2 do not conflict, the two instructions can execute simultaneously or be reordered. This technique that dynamically eliminates name dependences in registers, is called register renaming. Register renaming can be done statically (= by compiler) or dynamically (= by hardware). Register renaming

45 Rolling back processor state After XBEGIN instruction we take a snapshot of the rename table To keep track of busy registers, we maintain an S (saved) bit for each physical register to indicate which registers are part of the active transaction and it includes the S bits with every renaming-table snapshot An active transaction’s abort handler address, nesting depth, and snapshot are part of its transactional state.

46 Memory State UTM represents the set of active transactions with a single data structure held in system memory, the x-state (short for “transaction state”).

47 Xstate Implementation The x-state contains a transaction log for each active transaction in the system. Each log consists of: – A commit record: maintains the transaction’s status: pending committed aborted – A vector of log entries: corresponds to a memory block that the transaction has read or written to. The entry provides: pointer to the block The block’s old value (for rollback) A pointer to the commit record Pointers that form a linked list of all entries in all transaction logs that refer to the same block. (Reader List)

48 Xstate Implementation (Cont) The final part of the x-state consists of: – log pointer – read-write bit for each memory block

49 X-state Data Structure 42 Transaction log 1 PENDING 42 Transaction log 2 PENDING Commit record Old value Block pointer Reader list Commit record pointer Transaction log entry W log pointer RW bit R X-state Application memory Old value Block pointer Reader list Commit record pointer block 43 42

50 More on x-state When a processor references a block that is already part of a pending transaction, the system checks the RW bit and log pointer to determine the correct action: – use the old value – use the new value – abort the transaction

51 Commit action 42 Transaction log 1 PENDING 42 Transaction log 2 PENDING Commit record Old value Block pointer Reader list Commit record pointer Transaction log entry W log pointer RW bit R X-state Application memory Old value Block pointer Reader list Commit record pointer block Transaction log 1 COMMITED

52 Cleanup action 42 Transaction log 1 COMMITED 42 Transaction log 2 PENDING Commit record Old value Block pointer Reader list Commit record pointer Transaction log entry W log pointer RW bit R X-state Application memory Old value Block pointer Reader list Commit record pointer block

53 Abort action 42 Transaction log 1 PENDING 42 Transaction log 2 PENDING Commit record Old value Block pointer Reader list Commit record pointer Transaction log entry W log pointer RW bit R X-state Application memory Old value Block pointer Reader list Commit record pointer block Transaction log 1 ABORTED 32 42

54 Transactions Conflict A conflict: When two or more pending transactions have accessed a block and at least one of the accesses is for writing. Performing a transaction load: – check that the log pointer refers to an entry in the current transaction log or the RW bit is R. Performing a transaction store: – check that the log pointer references no other transaction In case of a conflict, some of the conflicting transactions are aborted. – Which transaction should be aborted?

55 Caching For small transaction that fits in cache, UTM, like earlier proposed HTM systems, uses cache coherence protocol to identify conflicts For transactions too big to fit in cache, the x-state for the transaction overflows into the ordinary memory hierarchy – Most log entries don't need to be created – Only create transaction log when transaction is run out of physical memory.

56 UTM’s Goal support transactions that run for an indefinite length of time migrate from one processor to another footprints bigger than the physical memory. The main technique we propose is to treat the x-state as a systemwide data structure that uses global virtual addresses

57 Benefits and Limits of UTM Limits: – Complicated implementation Benefits: – Unlimited footprint – Unlimited duration – Migration possible – Good performance in the common case (small transactions)

58 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

59 LTM: Visible, Large, Frequent, Scalable “Large Transactional Memory” – Not truly unbounded, but simple and cheap Minimal architectural changes, high performance – Small modifications to cache and processor core – No changes to main memory, cache coherence protocol – Can be pin-compatible with conventional processors

60 LTM’s Restrictions : Limiting a transaction’s footprint to (nearly) the size of physical memory. Duration must be less than a time slice Transactions cannot migrate between processors. With these restrictions, we can implement LTM by modifying only the cache and processor core

61 LTM vs UTM Like UTM, LTM maintains data about pending transactions in the cache and detects conflicts using the cache coherency protocol Unlike UTM, LTM does not treat the transaction as a data structure. Instead, it binds a transaction to a particular cache. – Transactional data overflows from the cache into a hash table in main memory LTM and UTM have similar semantics: XBEGIN and XEND instructions are the same In LTM, the cache plays a major part…

62 Addition to Cache LTM adds a bit (T) per cache line to indicate that the data has been accessed as part of a pending transaction. An additional bit (O) is added per cache set to indicate that it has overflowed.

63 Cache overflow mechanism OT TagData Overflow hashtable Key ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 XEND Data

64 Cache overflow mechanism OT TagData Overflow hashtable Key ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 XEND Data

65 Cache overflow: recording reads T OT TagData Overflow hashtable Key ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 XEND Data

66 Cache overflow: recording writes T T OT TagData Overflow hashtable Key ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 XEND Data

67 Cache overflow: spilling T T O OT TagData Overflow hashtable Key ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 XEND Data

68 Cache overflow: miss handling T T O OT TagData Overflow hashtable Key ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 XEND Data

69 LTM - Summary Transactions as large as physical memory Scalable overflow and commit Easy to implement! Low overhead

70 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Cache coherence protocol  General Implementation  Simulation UTM LTM TCC (briefly) Conclusions

71 Transactional Memory Coherence and Consistency (TCC) Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004). “Transactional Memory Coherence and Consistency” All transactions, all the time! Code partitioned into transactions by programmer or tools – Possibly at run-time, for legacy code! All writes buffered in caches, CPUs arbitrate system-wide for which one gets to commit Updates broadcast to all CPUs. CPUs detect conflicts of their transactions and abort

72 TCC Implementation r m  V tag data Commit control Write buffer Local cache hierarchy Broadcast bus or network snoopingcommits CPU Core stores only Loads & stores

73 Conclusions Unbounded, scalable, and efficient Transactional Memory systems can be built. – Support large, frequent, and concurrent transactions – Allow programmers to (finally!) use our parallel systems! Three architectures: – LTM: easy to realize, almost unbounded – UTM: truly unbounded – TCC: high performance

74 THE END… Royi Maimon Merav Havuv