Transactional Memory : Hardware Proposals Overview

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.
Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev.
Computer Organization and Architecture
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
1 Lecture 5: TM – Lazy Implementations Topics: TM design (TCC) with lazy conflict detection and lazy versioning, intro to eager conflict detection.
1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)
1 Lecture 7: Lazy & Eager Transactional Memory Topics: details of “lazy” TM, scalable lazy TM, implementation details of eager TM.
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood Presented by Colleen Lewis.
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Translation Lookaside Buffer
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Speculative Lock Elision
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Virtualizing Transactional Memory
Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
Lecture 21 Synchronization
Architecture and Design of AlphaServer GS320
PHyTM: Persistent Hybrid Transactional Memory
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 11: Consistency Models
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Transactional Memory Coherence and Consistency
The University of Adelaide, School of Computer Science
Lecture 11: Transactional Memory
Lecture: Consistency Models, TM
Lecture 6: Transactions
Lecture 21: Transactional Memory
Transactional Memory An Overview of Hardware Alternatives
Translation Lookaside Buffer
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
Lecture 25: Multiprocessors
CSE451 Virtual Memory Paging Autumn 2002
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Consistency Models, TM
The University of Adelaide, School of Computer Science
Lecture: Transactional Memory
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Transactional Memory : Hardware Proposals Overview Manu Awasthi Architecture Reading Club Fall 2006

Why do we care? Today’s methodologies (Locks) The rise of multicore architectures, CMP’s (Support for) Lots of cheap threads available Synchronization will be an issue Concurrent updates on shared memory Today’s methodologies (Locks) Are not scalable Fail to exploit concurrency to the fullest

Why Locks are EVIL? Locks: objects only one thread can hold at a time Organization: lock for each shared structure Usage: (block)  acquire  access  release Correctness issues Under-locking  data races Acquires in different orders  deadlock Performance issues Conservative serialization Overhead of acquiring Difficult to find right granularity Blocking

Example of evil Locks struct Shared_Structure{ int shared_var1; : };

Example of evil Locks struct Shared_Structure{ int shared_var1; : };

Example of evil Locks struct Shared_Structure{ int shared_var1; : };

Example of evil Locks struct Shared_Structure{ int shared_var1; : };

Coarse-Grained Locking Easily made correct … But not scalable.

Fine-Grained Locking more scalable High overhead in acquire and release Increased complexity

Enter Transactions… Code segments with three features: Atomicity Serialization only on conflicts Rollback support <begin_transaction> { statement_1; statement_2; statement_3;….. } <end_transaction> Generally, critical section = transaction atomic instructions

Agenda Transactions: what all the hoopla’s about Research Proposals Usages Implementations Disclaimer 1: Covering only hardware support Disclaimer 2: Purely an overview

Hardware Overview Exploit Cache coherence protocols Already do almost what we need Invalidation Consistency checking Exploit Speculative execution Branch prediction = optimistic synchro!

Execution Strategy Four main components: Logging/buffering (Speculative Execution) Conflict detection Abort/rollback Commit All papers present different methods of doing the above.

HW Transactional Memory read active T caches Interconnect memory

Transactional Memory read active active T T caches memory

Transactional Memory active committed active T T caches memory

Transactional Memory write committed active T D caches memory

Rewind write aborted active active T T D caches memory

Transaction Commit At commit point Mark transactional entries If no cache conflicts, we win. Mark transactional entries Read-only: valid Modified: dirty (eventually written back)

But…. Limits to Transaction cannot commit if it is Transactional cache size Scheduling quantum Transaction cannot commit if it is Too big Too slow Actual limits platform-dependent

TLR/SLE Transactional execution of critical sections. [Rajwar & Goodman, ASPLOS ‘02] TLR/SLE Transactional execution of critical sections. Locks define scope of a transaction Doesn’t change the programming model H/W identifies and speculatively executes critical sections. Timestamps provide serializabilty.

SLE Mechanism to identify lock acquires and releases Enabling mechanism for TLR Concept of silent stores

SLE Algo

SLE Algo

Livelocks

TLR Algo..

TCC @ Stanford Again, speculative transaction execution [Hammond+, ISCA ‘04 & ASPLOS ‘04] TCC @ Stanford Again, speculative transaction execution Identify transaction start and end Read set, write set. Save architectural state Check for conflicts on memory references Snoop over system bus to check for violations Fold the commit state in a packet Send over sys bus, commit Centralized bus arbiter => scalability limits!!

TCC – Programming Model Divide into transactions Here, its programmer’s job However, easier to do than locks. Why? Specify order In case relative ordering of transaction commit matters e.g.? Assign phase numbers to transactions.

TCC Node

Some Results Small read state (6-12 kB) Write state (4-8 kB) Both of above per benchmark, per processor Significant speedup Not so modest bandwidth requirements

UTM/LTM @ Stanford Most transactions are small 99.9% touch 54 cache lines or less BUT, some go upto 8000 lines (!!!!!) Thesis : transaction footprint should be unbounded Added ISA support for the same Book-keeping, in memory, transaction log Helps survive interrupts, process migration

So, What’s New? Rollback Support : Rename Tables snapshot. ISA support XBEGIN pc XEND Rollback Support : Rename Tables snapshot. Xstate data structure for memory state has log records of all active transactions Log = commit record + log entry vector Log pointer RW bit

Processor Modifications

The Xstate DS

Interesting Results

LogTM @ UW-Madison Motivation : Make the common case fast Commits are more frequent than aborts Basic Strategy : similar to UTM Store new values in place, old values in log Log properties Per thread log Cacheable in virtual memory i.e. part of thread address space reserved for logging. Log writes mostly cache hits (small transactions) Low TLB translation overhead (small transactions)

Conflict Detection Directory based protocol Extended Directory states Send request to directory Directory forwards requests to processors Each processors checks for conflicts Ack (No conflict), Nack (Conflict) Resolve conflict based on responses. Extended Directory states For taking care of transactional line overflow

More Work @ UW-Madison VTM (Rajwar+) Thread Level TM (Goodman +) Goal: persistent transactions with less overhead Approach: group transactions by process Implementation: buffer in cache + overflow table in virtual memory + various interesting optimizations

Summary Transactions: Promising approach to synchronization Challenges Simple interface + efficient implementation Uses: optimistic lock removal, lock-free data structures, general-purpose synchronization, parallelization, ?? Challenges Implementation Interface OS involvement I/O + rollback