TokenTM: Token-Based Hardware Transactional Memory

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
© 2006 Mulitfacet ProjectUniversity of Wisconsin-Madison Supporting Nested Transactional Memory in LogTM Michelle J. Moravan, Jayaram Bobba, Kevin E. Moore,
Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.
Multiple Processor Systems
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
Continuously Recording Program Execution for Deterministic Replay Debugging.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood Presented by Colleen Lewis.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark Hill, Karu Sankaralingam University of Wisconsin-Madison.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper.
© 2006 Mulitfacet ProjectUniversity of Wisconsin-Madison LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Multiprocessors – Locks
CS161 – Design and Architecture of Computer
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Software Coherence Management on Non-Coherent-Cache Multicores
CS161 – Design and Architecture of Computer
Speculative Lock Elision
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Virtualizing Transactional Memory
Transactional Memory : Hardware Proposals Overview
Lecture 21 Synchronization
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
Cache Memory Presentation I
Jayaram Bobba Dissertation Defense 1/14/2010 Overview:
CMSC 611: Advanced Computer Architecture
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 19: Transactional Memories III
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Lecture 2: Snooping-Based Coherence
Lecture 6: Transactions
Transactional Memory An Overview of Hardware Alternatives
Improving Multiple-CMP Systems with Token Coherence
Lecture 22: Consistency Models, TM
Lecture 25: Multiprocessors
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
High Performance Computing
The University of Adelaide, School of Computer Science
Performance Pathologies in Hardware Transactional Memory
Performance Pathologies in Hardware Transactional Memory
Lecture 8: Efficient Address Translation
Lecture 23: Transactional Memory
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Border Control: Sandboxing Accelerators
Presentation transcript:

TokenTM: Token-Based Hardware Transactional Memory Jayaram Bobba, Neelam Goyal, Mark D. Hill, Michael M. Swift, and David A. Wood Multifacet Project (www.cs.wisc.edu/multifacet) Dept. of Computer Sciences University of Wisconsin-Madison

LogTM: Log-based Transactional Memory 2/17/2019 Executive Summary Current Hardware TMs Most Transactions Small & Short Running Penalize large/long transactions Too restrictive for wide-spread TM use? Hypothesis Must Support Efficient Large/Long Transactions As Well Is such an HTM even possible? Yes! TokenTM 1. LogTM’s Log to buffer unbounded values 2. Transactional Tokens for unbounded conflict detection Conflict state in memory metabits Concurrent updates via metastate fission/fusion 2/17/2019 Wisconsin Multifacet Project UW-Madison Architecture Seminar

© 2008 Multifacet Project University of Wisconsin-Madison Existing HTM Systems Assumption: Most transactions small & short running Optimized for small transactions Degrade with large, long running transactions Non-localized Overhead, E.g., LogTM-SE [Yen07] false conflicts OneTM [Blundel07] serializes Complex, Expensive Operations, E.g., XTM [Chung06]& PTM [Chuang06] manipulate page tables Premature Optimization? © 2008 Multifacet Project University of Wisconsin-Madison

Why Large Transactions? LogTM: Log-based Transactional Memory 2/17/2019 Why Large Transactions? Programmers may want large (>>cache) and/or long (>> ctx switch) transactions HLL transactions invoke unpredictable lower-level code Replace critical sections containing syscalls or I/O Avoid concurrency bugs [Lu08] But “Most transactions small & short running” Restrict TM to use by gurus (like OS spin locks)? Self fulfilling prophesy? Must Support Efficient Large/Long Transactions As Well © 2008 Multifacet Project University of Wisconsin-Madison UW-Madison Architecture Seminar

Toward a Large-Transaction TM Efficiently detect conflicts between in-flight transactions using Read/Write Sets Unbounded Globally accessible Fast read/write set ops. E.g., Add to read set Clear read set Small Transactions: Low Overhead Large Transactions: Localized Overhead Accessible read/write set (potentially unbounded) N Minimal Changes to Coherence / VM Heavyweight eviction ops Negative acks Additional page tables O © 2008 Multifacet Project University of Wisconsin-Madison

Existing Mechanisms × ×  Synergy between cache coherence and conflict detection Hence, overload cache coherence + Excellent for bounded/small TM But, - ‘Virtualization’ on overflows - Tough to access ‘virtualized’ state Small Transactions: Low Overhead  Minimal Changes to Coherence / VM × Large Transactions: Localized Overhead × © 2008 Multifacet Project University of Wisconsin-Madison

TokenTM: a Large-Transaction TM New Conflict Detection Mechanism Transactional Tokens in Tagged Memory Token Coherence [Martin03] at different level Version Management Save old/new values for unbounded Write set LogTM [Moore06] undo log This Talk © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Outline Motivation Design Token-Based Conflict Detection Metadata Storage Implementation Results © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Transactional Tokens Challenge: How to efficiently track Read/Write sets? Token Coherence [Martin03] Read/Write sets for cache coherence Solution: Transactional Tokens T tokens per memory block At least one token to read, All T tokens to write (token conflict detection) Token Metadata <c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held by thread with TID i. © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Tagged Memory Challenge: Where to store Unbounded, Globally Accessible Token Metadata? Virtual Memory unbounded and globally accessible Solution, similar to OneTM [Blundel07] Tag Virtual Memory Piggyback on existing Virtual Memory and Cache Coherence mechanisms © 2008 Multifacet Project University of Wisconsin-Madison

TokenTM Logical Operation Thread X Thread Y PC BEGIN_XACT Undo Log Undo Log PC BEGIN_XACT Load A Load A Store B ABORT Store A COMMIT_XACT COMMIT_XACT Shared Memory Block Data A 0x..00.. B C 0x..10.. Metadata <cx, cy, …> <0,0,…> Insufficient tokens <0,0,…> <1,1,…> <1,0,…> B: 0x..00.. 0x..11.. 0x..00.. <0,0,…> <T,0,…> © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Storing Metadata Unbounded Difficult to access globally Thread X Thread Y PC PC BEGIN_XACT BEGIN_XACT Undo Log Token log Undo Log Token log Load A Load A Cx CY Store B Store A COMMIT_XACT COMMIT_XACT Software Tagged Memory Hardware Block Data A 0x..00.. B C 0x..10.. Metadata <cx, cy, …> <0,0,…> Metastate (Sum, TID) (0, -) Concise Accessible Lossy Summary © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Hardware Metastate Metadata summary (sum, TID) sum, total number of tokens acquired TID, identify owner when sum = 1 or sum = T (optional) Some summaries, Concise -> Stored in packed field (e.g., State[1:2] , Attr[3:16]) Fast -> Accessed as part of normal memory operation <c0, c1, …, ci, …> (sum, TID) <0, 0, 0, 0> (0, -) <0, 0, 1, 0> (1, 2) <0, T, 0, 0> (T, 2) <0, 1, 1, 1> (3, -) © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Token Logs Distributed structures for unbounded Read/Write sets per-thread stored in program memory (e.g., heap) list of <address, num_tokens> Accessible to hardware for fast ops Add to read set -> Append to token log Token log A: 1 B: T © 2008 Multifacet Project University of Wisconsin-Madison

Double-entry Bookkeeping (Keeping Metadata Consistent) Thread X Thread Y PC PC BEGIN_XACT Token log Token log BEGIN_XACT Logical Token State Load A Load A Store B A: 1 A: 1 Store A Metadata <cx, cy, …> COMMIT_XACT COMMIT_XACT Software <1,0,…> <1,1,…> <0,0,…> Hardware <0,0,…> Block Metastate (Sum, TID) A B C <0,0,…> (2, -) (0, -) (1, X) (0, -) (0, -) © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Outline Motivation Design Implementation Metastate Fission/Fusion Results © 2008 Multifacet Project University of Wisconsin-Madison

Implementing Hardware Metastate Thread X Thread Y BEGIN_XACT Token log Token log BEGIN_XACT Load A Load A Store B A: 1 Store A COMMIT_XACT COMMIT_XACT Software Load A Load A Coherence State Coherence State Hardware Tag Data Tag Data Sum TID Sum TID Private Caches A 1 X - Modified Exclusive Owned 0x..00.. 0x..00.. 1, X A Shared 0x..00.. 1 X Data A DATA A Fwd_GETS A Metastate (Sum, TID) (0,0) GETS A GETS A Upgrade A Block Directory Data Sum TID 0, - Main Memory A Exclusive @ P1 Not Present Shared @ P1,P2 0x..00.. 0x..00.. - Shared copies cannot update metastate Solution: Fission / Fusion © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Metastate Fission Thread X Thread Y BEGIN_XACT Token log Token log BEGIN_XACT Load A Load A A: 1 A: 1 Store B Store A COMMIT_XACT COMMIT_XACT Software Hardware 1,X fission Load A Coherence State Coherence State Tag Data Sum TID Tag Data Sum TID Private Caches 1,X 0,- A Owned Modified 0x..00.. 1 X A Shared 0x..00.. 1 - Y 0x..00.. Data A GETS A Fwd_GETS A Block Directory Data Sum TID Main Memory A Shared @ P1,P2 Exclusive @ P1 0x..00.. © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Metastate Fusion Metastate Fusion On store, metastate copies fused back Why does fission/fusion work? Store sees ‘complete’ metastate Load sees ‘complete’ metastate, if writer exists ‘partial’ metastate, otherwise © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Hardware Cost Additional metabits in caches/memory Recoded ECC to cull metabits Changes to coherence protocols Additional payload on messages Minimal changes to protocol logic Requires non-silent eviction © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Outline Motivation Design Implementation Results Do we meet the two performance goals? Small Transactions: Low Overhead Large Transactions: Localized Overhead © 2008 Multifacet Project University of Wisconsin-Madison

Evaluation Methodology Full System Simulation Multifacet GEMS Base System 32-core CMP system, in-order, single-issue cores Private 4-way 32KB writeback split I&D L1 caches Shared 8-way 8 MB writeback L2 On-chip directory @ L2, MESI coherence Packet-switched interconnect in a tiled topology © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison TM Systems LogTM-SE [Yen07] variant Parallel Bloom Filters for conflict detection 4 2Kbit H3 filters + Compact, less hardware overhead - False Conflicts LogTM-SE_Perfect + No False Conflicts - Unimplementable TokenTM © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Results Large Transactions: Localized Overhead Minor degradation with large transactions Comparable on small transactions Small Transactions: Low Overhead © 2008 Multifacet Project University of Wisconsin-Madison

TokenTM Conflict Detection Large Transactions: Localized Overhead Accessible read/write set (potentially unbounded)  Fast read/write set ops. E.g., Add to read set Clear read set Small Transactions: Low Overhead  N Minimal Changes to Coherence / VM Heavyweight eviction ops Negative acks Additional page tables  O © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison In the paper… Fast Token Release TM ‘virtualization’ events Context Switches, Paging etc. System V shared memory Long Running Critical Sections in server workloads Fission/Fusion useful for other TM systems USTM [Baugh08], set Fault-on-Write UFO bit without exclusive permission © 2008 Multifacet Project University of Wisconsin-Madison

LogTM: Log-based Transactional Memory 2/17/2019 Executive Summary Current Hardware TMs Most Transactions Small & Short Running Penalize large/long transactions Too restrictive for TM use up/down software stack? Hypothesis Must Support Efficient Large/Long Transactions As Well Is such an HTM even possible? Yes! TokenTM 1. LogTM’s Log to buffer unbounded values 2. Transactional Tokens for unbounded conflict detection Conflict state in memory metabits Concurrent updates via metastate fission/fusion 2/17/2019 Wisconsin Multifacet Project UW-Madison Architecture Seminar

© 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Common Token Ops Actions by thread X Before (Sum, TID) After Acquire One Token (0, -) (1, X) Acquire T Tokens (T, X) Release One Token (v, -) (v-1, -) Release T tokens Conflicting Load (T, Y), Y≠X Conflicting Store (v, -), v≠0 © 2008 Multifacet Project University of Wisconsin-Madison

Workload Characteristics Benchmark Input Unit o f Work Units Measured Num Xacts Avg Read-Set Avg Write-Set Max Read-Set Max Write-set Barnes 512 bodies parallel phase 1 2,553 6.1 4.2 42 39 Cholesky tk14.O factorization 60,203 2.4 1.7 6 4 Radiosity batch 1 task 1024 21,786 1.8 1.5 25 24 Raytrace teapot 47,783 5.1 2.0 594 Delaunay gen2.2-m30 16,384 51.4 38.8 507 345 Genome g1024-s32-n65536 100,115 14.5 2.1 768 18 Vacation-Low low contention 16,399 70.7 18.1 162 75 Vacation-High High contention 99.1 18.6 331 80 © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison TokenTM Overheads © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Results Minor degradation with large transactions Comparable on small transactions © 2008 Multifacet Project University of Wisconsin-Madison

Fast Release (optional) Thread X (Sum, TID) R W R’ W’ R+ Attr (0, -) - (u, -) 1 u-1 u (1, X) X (1, Y) Y (T, X) (T, Y) PC BEGIN_XACT Token log Load A Store B A: 1 COMMIT_XACT B: T Token LogPtr TID Fast-Release X Flash_Clear 1 Tag Data R W R’ W’ R+ Attr A 0x..00.. 1 X B 0x..01.. 1 X - - © 2008 Multifacet Project University of Wisconsin-Madison

Is Fast Release necessary? © 2008 Multifacet Project University of Wisconsin-Madison

Token Operations Double-entry Bookkeeping Thread X Thread Y PC PC Begin_XACT Token log Token log BEGIN_XACT Logical Token State Load A Load A Store B A: 1 A: 1 Store A Metadata <cx, cy, cz> <0,0,0> Commit_XACT COMMIT_XACT B: T Software <1,1,0> <0,0,0> <1,0,0> Hardware <T,0,0> <0,0,0> Block Data Metastate (Sum, TID) A 0x..00.. B C 0x..10.. (2, -) (0, -) (1, X) 0x..11.. 0x..00.. (T, X) (0, -) (0, -) © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Fission Rules Before After Copy1 Copy2 (u, -) (0, -) (1, X) (T, X) Assume Copy2 sent to new Reader Is there a writer? No writer © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Fusion Rules Copy1 Copy2 (v, -) (1, Y) (T, Y) (u, -) (u + v, -) (1, Y) if u = 0 (u + 1, - ) else (T, Y) if u = 0 error else (1, X) (1, X) if v = 0 (v + 1, -) else (2, -) error (T, X) (T, X) if v = 0 error else (T, X) if X = Y Add the two counts Forget token owner if count > 1 © 2008 Multifacet Project University of Wisconsin-Madison

© 2008 Multifacet Project University of Wisconsin-Madison Metastate Fusion Thread X Thread Y Begin_XACT Token log Token log BEGIN_XACT Load A Load A A: 1 Store B A: 1 Store A Commit_XACT COMMIT_XACT Software Conflict Store A Hardware Coherence State Coherence State fusion Tag Data Sum TID Tag Data Sum TID Insufficient tokens Private Caches Invalid 1,X A Owned 0x..00.. 1 X A Shared Modified 0x..00.. 1 1,Y 2 Y - 2,- Inv A Ack A Upgrade A Tag Directory Data Sum TID Main Memory A Modified @ P2 P2 Shared @ P1,P2 0x..00.. © 2008 Multifacet Project University of Wisconsin-Madison

Modifying Hardware Metastate (Take 1) Thread X Thread Y Begin_XACT Token log Token log BEGIN_XACT Load A Load A Store B A: 1 Store A Commit_XACT COMMIT_XACT Software Load A Coherence State Coherence State Hardware Tag Data Tag Data Private Caches A Exclusive 0x..00.. 1, X DATA A Extra main memory access on every metastate update GETS A Tag Directory Data Sum TID 0, - Main Memory A Exclusive @ P1 Not Present 0x..00.. 0x..00.. - © 2008 Multifacet Project University of Wisconsin-Madison