1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Maurice Herlihy (DEC), J. Eliot & B. Moss (UMass)
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Cache Optimization Summary
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Transactional Memory Part 1: Concepts and Hardware- Based Approaches 1Dennis Kafura – CS5204 – Operating Systems.
1 Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007.
Transactional Memory: Architectural Support for Lock- Free Data Structures Herlihy & Moss Presented by Robert T. Bauer.
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
The University of Adelaide, School of Computer Science
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Transactional Memory CDA6159. Outline Introduction Paper 1: Architectural Support for Lock-Free Data Structures (Maurice Herlihy, ISCA ‘93) Paper 2: Transactional.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.
Transactional Memory Lecturer: Danny Hendler. 2 2 From the New York Times…
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Additional Material CEG 4131 Computer Architecture III
1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
Performance of Snooping Protocols Kay Jr-Hui Jeng.
Advanced Operating Systems (CS 202) Transactional memory Jan, 27, 2016 slide credit: slides adapted from several presentations, including stanford TCC.
The University of Adelaide, School of Computer Science
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
COSC6385 Advanced Computer Architecture
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Example Cache Coherence Problem
Module IV Memory Organization.
Lecture 21: Synchronization and Consistency
Part 1: Concepts and Hardware- Based Approaches
Lecture: Coherence and Synchronization
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
The University of Adelaide, School of Computer Science
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Presentation transcript:

1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given by Prof. Yehuda Afek.

2 Outline Hardware Transactional Memory (HTM)  Transactions  Caches and coherence protocols  General Implementation  Simulation

3 What is a transaction? A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts If a transaction commits, all the loads and stores appear to have executed atomically If a transaction aborts, none of its stores take effect Transaction operations aren't visible until they commit (if they do)

4 Transactional Memory A new multiprocessor architecture The goal: Implementing non-blocking synchronization that is – efficient – easy to use compared with conventional techniques based on mutual exclusion Implemented by straightforward extensions to multiprocessor cache-coherence protocols and / or by software mechanisms

5 5 Outline Hardware Transactional Memory (HTM)  Transactions  Caches and coherence protocols  General Implementation  Simulation

6 A cache is an associative (a.k.a. content-addressable) memory Conventional memory Address Associative memory Address A, s.t. *A=D Data D

7 Cache Associativity

8 Fully associative cache

9 Cache tags and address structure Main MemoryCache Indexes and Tags are typically high- order address bits

10 Cache-Coherence Protocol In multiprocessors, each processor typically has its own local cache memory – Minimize average latency due to memory access – Decrease bus traffic – Maximize cache hit ratio A Cache-coherence protocol manages the consistency of caches and main memory: – Shared memory semantics maintained – Caches and main memory communicate to guarantee coherency

11 The need to maintain coherency Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson

12 Coherency requirements Text taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson

13 Snoopy Cache All caches monitor (snoop) the activity on a global bus/interconnect to determine if they have a copy of the block of data that is requested on the bus.

14 Coherence protocol types Write through: the information is written to both the cache block and to the block in the lower-level memory Write-back: the information is written only to the cache block. The modified cache block is written to main memory only when it is replaced

15 3-state Coherence protocol Invalid: cache line/block does not contain legal information Shared: cache line/block contains information that may be shared by other caches Modified/exclusive: cache line/block was modified while in cache and is exclusively owned by current cache

16 Cache-coherency mechanism (write-back)

Cache-coherency mechanism – state transition diagram Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson Transitions based on processor requests Transitions based on bus requests

18 MESI protocol (Goodman, 1983) I (Invalid) S (Shared) E (Exclusive M (Modified) Cache line status NoYes Is line valid? __Yes NoMain memory updated? __MaybeNo Other cache copies exist?

19 Outline Hardware Transactional Memory (HTM)  Transactions  Caches and coherence protocols  General Implementation  Simulation

20 HTM-supported API The following primitive instructions for accessing memory are provided: Load-transactional (LT): reads value of a shared memory location into a private register. Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified. Store-transactional (ST) tentatively writes a value from a private register to a shared memory location. Commit (COMMIT) Abort (ABORT) Validate (VALIDATE) tests the current transaction status.

21 Some definitions Read set: the set of locations read by LT by a transaction Write set: the set of locations accessed by LTX or ST issued by a transaction Data set (footprint): the union of the read and write sets. A set of values in memory is inconsistent if it couldn’t have been produced by any serial execution of transactions

22 Intended Use Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would: 1. use LT or LTX to read from a set of locations 2. use VALIDATE to check that the values read are consistent, 3. use ST to modify a set of locations 4. use COMMIT to make the changes permanent. If either the VALIDATE or the COMMIT fails, the process returns to Step (1).

23 Implementation Hardware transactional memory is implemented by modifying standard multiprocessor cache coherence protocols Herlihy and Moss suggested to extend “snoopy” cache protocol for a shared bus to support transactional memory Supports short-lived transactions with a relatively small data set.

24 The basic idea Any protocol capable of detecting register access conflicts can also detect transaction conflict at no extra cost Once a transaction conflict is detected, it can be resolved in a variety of ways

25 Implementation Each processor maintains two caches – Regular cache for non-transactional operations, – Transactional cache small, fully associative cache for transactional operations. It holds all the tentative writes, without propagating them to other processors or to main memory (until commit) An entry may reside in one cache or the other but not in both

26 Cache line states Each cache line (regular or transactional) has one of the following states: Each transactional cache lines has (in addition) one of these states: (Exclusive) “Old” values “New” values (Modified)

27 Cleanup When the transactional cache needs space for a new entry, it searches for: – A TC_INVALID entry – If none - a TC_NORMAL entry – finally for an TC_COMMIT entry (why can such entries be replaced?)

28 Processor actions Each processor maintains two flags: – The transaction active (TACTIVE) flag: indicates whether a transaction is in progress – The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False) Non-transactional operations behave exactly as in original cache-coherence protocol

29 Example – LT operation: Look for tc_ABORT entry Return its value Look for NORMAL entry Change it to tc_ABORT and allocate another tc_COMMIT entry with same value Found? Not Found? Ask to read this block from the shared memory Found? Not Found? Successful read Create two entries: tc_ABORT and tc_COMMIT Busy signal Abort the transaction:  TSTATUS=FALSE  Drop tc_ABORT entries  All tc_COMMIT entries are set to tc_NORMAL Cache miss

30 Snoopy cache actions: Both the regular cache and the transactional cache snoop on the bus. A cache ignores any bus cycles for lines not in that cache. The transactional cache’s behavior: – If TSTATUS=False, or if the operation isn’t transactional, the cache acts just like the regular cache, but ignores entries with state other than TC_NORMAL – Otherwise: On LT of another cpu, if the state is TC_NORMAL or the line not written to, the cache returns the value, and in all other cases it returns BUSY

31 Committing/aborting a transaction Upon commit  Set all entries tagged by TC_COMMIT to TC_INVALID  Set all entries tagged by TC_ABORT to TC_NORMAL Upon abort  Set all entries tagged by TC_ABORT to TC_INVALID  Set all entries tagged by TC_COMMIT to TC_NORMAL Since transactional cache is small, it is assumed that these operations can be done in parallel.

32 Outline Lock-Free Hardware Transactional Memory (HTM)  Transactions  Caches and coherence protocols  General Implementation  Simulation

33 Simulation We’ll see an example code for the producer/consumer algorithm using transactional memory architecture. The simulation runs on both cache coherence protocols: snoopy and directory cache. The simulation uses 32 processors The simulation finishes when 2^16 operations have completed.

34 Part Of Producer/Consumer Code typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE]; } queue; unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result; }

35 The results: Snoopy cacheDirectory-based coherency

36 Transactional size is limited by cache size Transaction length effectively limited by scheduling quantum Process migration problematic Key Limitations:

37 MSA: A few sample research directions  Theoretic o Are there counters/stacks/queues with sub-linear write-contention? o What is the space complexity of obstruction-free read/write consensus? o What is the step-complexity of 1-time read/write counter? o...  (More) practical o The design of efficient lock-free/blocking concurrent objects o Defining more realistic metrics for blocking synchronization, and designing algorithms that are efficient w.r.t these metrics o Improve the usability of transactional memory o...