Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin.

Slides:



Advertisements
Similar presentations
Serializability in Multidatabases Ramon Lawrence Dept. of Computer Science
Advertisements

Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Principles of Transaction Management. Outline Transaction concepts & protocols Performance impact of concurrency control Performance tuning.
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
Concurrency. Busy, busy, busy... In production environments, it is unlikely that we can limit our system to just one user at a time. – Consequently, it.
Transaction Processing: Concurrency and Serializability 10/4/05.
Transaction Management
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan,
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel - Presentation By Sathish P.
Concurrency. Correctness Principle A transaction is atomic -- all or none property. If it executes partly, an invalid state is likely to result. A transaction,
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
KAUSHIK LAKSHMINARAYANAN MICHAEL ROZYCZKO VIVEK SESHADRI Transactional Memory: Hybrid Hardware/Software Approaches.
TRANSACTIONS AND CONCURRENCY CONTROL Sadhna Kumari.
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
Transactional Memory CDA6159. Outline Introduction Paper 1: Architectural Support for Lock-Free Data Structures (Maurice Herlihy, ISCA ‘93) Paper 2: Transactional.
08_Transactions_LECTURE2 DBMSs should guarantee ACID properties (Atomicity, Consistency, Isolation, Durability). This is typically done by guaranteeing.
The Linux Kernel: A Challenging Workload for Transactional Memory Hany E. Ramadan Christopher J. Rossbach Emmett Witchel Operating Systems & Architecture.
Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech.
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
II.I Selected Database Issues: 2 - Transaction ManagementSlide 1/20 1 II. Selected Database Issues Part 2: Transaction Management Lecture 4 Lecturer: Chris.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Transactions and Concurrency Control. Concurrent Accesses to an Object Multiple threads Atomic operations Thread communication Fairness.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
Hany E. Ramadan and Emmett Witchel University of Texas at Austin TRANSACT 2/15/2009 The Xfork in the Road to Coordinated Sibling Transactions.
Solving Difficult HTM Problems Without Difficult Hardware Owen Hofmann, Donald Porter, Hany Ramadan, Christopher Rossbach, and Emmett Witchel University.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
A BORTING C ONFLICTING T RANSACTIONS IN AN STM PP O PP’09 2/17/2009 Hany Ramadan, Indrajit Roy, Emmett Witchel University of Texas at Austin Maurice Herlihy.
Advanced Operating Systems (CS 202) Transactional memory Jan, 27, 2016 slide credit: slides adapted from several presentations, including stanford TCC.
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
University of Texas at Austin
Transactional Memory Coherence and Consistency
Changing thread semantics
Lecture 6: Transactions
Lecture 21: Concurrency & Locking
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
Distributed Transactions
Lecture 21: Intro to Transactions & Logging III
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
Performance Pathologies in Hardware Transactional Memory
BulkCommit: Scalable and Fast Commit of Atomic Blocks
Performance Pathologies in Hardware Transactional Memory
Concurrent Cache-Oblivious B-trees Using Transactional Memory
Presentation transcript:

Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

Concurrency Conundrum Challenge: CMP ubiquity Parallel programming with locks and threads is difficult –deadlock, livelock, convoys… –lock ordering, poor composability –performance-complexity tradeoff Transactional Memory (HTM) –simpler programming model –removes pitfalls of locking –coarse-grain transactions can perform well under low- contention 2 cores 16 cores 80 cores (neat!) 2

High contention: optimism warranted? LocksTransactions Read- Sharing Write- Sharing TM performs poorly with write-shared data –Increasing core counts make this worse Write sharing is common –Statistics, reference counters, lists… Solutions complicate programming model –open-nesting, boosting, early release, ANTs… TM’s need not always serialize write-shared Transactions! Dependence-Awareness can help the programmer (transparently!) 3

Outline Motivation Dependence-Aware Transactions Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 4

Two threads sharing a counter … load count,r inc r store r,count … Schedule: dynamic instruction sequence models concurrency by interleaving instructions from multiple threads Initially: count == 0 Result: count == 2 time Thread B … load count,r inc r store r,count … Thread A count: memory 5

count: TM shared counter xbegin load count,r inc r store r,count xend Conflict: both transactions access a datum, at least one is a write intersection between read and write sets of transactions Initially: count == 0 time Tx B xbegin load count,r inc r store r,count xend Tx A 0 1 Conflict! Current HTM designs cannot accept such schedules, despite the fact that they yield the correct result! memory 6

Does this really matter? xbegin load X,r0 inc r0 store r0,X xend Critsec B xbegin load X,r0 inc r0 store r0,X xend Critsec A xbegin load count,r inc r store r,count xend Critsec B xbegin load count,r inc r store r,count xend Critsec A DATM: use dependences to commit conflicting transactions Common Pattern! Statistics Linked lists Garbage collectors 7

count: memory DATM shared counter xbegin load count,r inc r store r,count xend time T2 xbegin load count,r inc r store r,count xend T Forward speculative data from T1 to satisfy T2’s load. T2 depends on T1 T1 must commit before T2 If T1 aborts, T2 must also abort If T1 overwrites count, T2 must abort Model these constraints as dependences Initially: count == 0 8

Conflicts become Dependences Dependence Graph Transactions: nodes dependences: edges edges dictate commit order no cycles  conflict serializable … write X write Y read Z … Tx B … read X write Y write Z … Tx A Tx B RWRW WWWW WRWR Read-Write: A commits before B Write-Write: A commits before B Write-Read: forward data overwrite  abort A commits before B 9

Enforcing ordering using Dependence Graphs xbegin load X,r0 store r0,X xend xbegin load X,r0 store r0,X xend Outstanding Dependences! xbegin load X,r0 store r0,X xend WRWR Wait for T1 T1T2 T1 must serialize before T2 T1T2 WRWR 10

Enforcing Consistency using Dependence Graphs xbegin load X,r0 store r0,X xend CYCLE! (lost update) xbegin load X, r0 store r0,X xend WWWW T1T2 RWRW T1T2 RWRW WWWW cycle  not conflict serializable restart some tx to break cycle invoke contention manager 11

Theoretical Foundation Serializability: results of concurrent interleaving equivalent to serial interleaving Conflict: Conflict-equivalent: same operations, order of conflicting operations same Conflict-serializability: conflict- equivalent to a serial interleaving 2-phase locking: conservative implementation of CS (LogTM, TCC, RTM, MetaTM, OneTM….) Serializable Conflict Serializable 2PL Correctness and optimality Proof: See [PPoPP 09] 12

Outline Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 13

DATM Requirements Mechanisms: Maintain global dependence graph –Conservative approximation –Detect cycles, enforce ordering constraints Create dependences Forwarding/receiving Buffer speculative data Implementation: coherence, L1, per-core HW structures 14

Dependence-Aware Microarchitecture Order vector: dependence graph TXID, access bits: versioning, detection TXSW: transaction status Frc-ND + ND bits: disable dependences These are not programmer-visible! 15

Dependence-Aware Microarchitecture These are not programmer-visible! Order vector: dependence graph topological sort conservative approx. 16

Dependence-Aware Microarchitecture Order vector: dependence graph TXID, access bits: versioning, detection TXSW: transaction status Frc-ND + ND bits: disable dependences These are not programmer-visible! 17

Dependence-Aware Microarchitecture Order vector: dependence graph TXID, access bits: versioning, detection TXSW: transaction status Frc-ND + ND bits: disable dependences, no active dependences These are not programmer-visible! 18

FRMSI protocol: forwarding and receiving MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xABT xCMT 19

FRMSI protocol: forwarding and receiving MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xABT xCMT 20

FRMSI protocol: forwarding and receiving MSI states TxMSI states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xABT xCMT 21

FRMSI protocol: forwarding and receiving MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xABT xCMT 22

Main Memory 100 cnt Converting Conflicts to Dependences 1 xbegin 2 ld R, cnt 3 inc cnt 4 st cnt, R … N xend 1 xbegin 2 ld R, cnt 3 inc cnt 4 st cnt, R 5 xend … core 0 core 1 core_0 TXID R PC core_1 TXID R PC L1 cnt OVec S TXID data L1 cnt OVec S TXID data TxA 100 TxA TMF TS TMM 101 TxB [A,B] TR TMF TxB TMR Outstanding Dependence (stall) xCMT(A) CTM

Outline Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 24

Experimental Setup Implemented DATM by extending MetaTM –Compared against MetaTM: [Ramadan 2007] Simulation environment –Simics machine simulator –x86 ISA w/HTM instructions –32k 4-way tx L1 cache; 4MB 4-way L2; 1GB RAM –1 cycle/inst, 24 cyc/L2 hit, 350 cyc/main memory –8, 16 processors Benchmarks –Micro-benchmarks –STAMP [Minh ISCA 2007] –TxLinux: Transactional OS [Rossbach SOSP 2007] 25

DATM speedup, 16 CPUs 15x 26

Eliminating wasted work 16 CPUs Lower is better 2%3%22%39%15x Speedup: 27

Dependence Aware Contention Management T3 T7 T6 T1 T4 T2 WRWR WWWW WRWR WRWR WRWR WRWR Timestamp contention management T3 5 transactions must restart! Dependence Aware T1 Cascaded abort? T2, T3, T7, T6, T4, T1 Order Vector T2T2, T3, T7, T6, T4 28

Contention Management 29

Outline Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 30

Related Work HTM –TCC [Hammond 04], LogTM[-SE] [Moore 06], VTM [Rajwar 05], MetaTM [Ramadan 07, Rossbach 07], HASTM, PTM, HyTM, RTM/FlexTM [Shriraman 07,08] TLS –Hydra [Olukotun 99], Stampede [Steffan 98,00], Multiscalar [Sohi 95], [Garzaran 05], [Renau 05] TM Model extensions –Privatization [Spear 07], early release [Skare 06], escape actions [Zilles 06], open/closed nesting [Moss 85, Menon 07], Galois [Kulkarni 07], boosting [Herlihy 08], ANTs [Harris 07] TM + Conflict Serializability –Adaptive TSTM [Aydonat 08] 31

Conclusion DATM can commit conflicting transactions Improves write-sharing performance Transparent to the programmer DATM prototype demonstrates performance benefits Source code available! 32

Broadcast and Scalability Because each node maintains all deps, this design uses broadcast for: Commit Abort TXOVW (broadcast writes) Forward restarts New Dependences These sources of broadcast could be avoided in a directory protocol: keep only the relevant subset of the dependence graph at each node 33 Yes. We broadcast. But it’s less than you might think…

FRMSI Transient states 19 transient states: sources: forwarding overwrite of forwarded lines (TXOVW) transitions from MSI  T* states Ameliorating factors best-effort: local evictions  abort, reduces WB states T*R* states are isolated transitions back to MSI (abort, commit) require no transient states Signatures for forward/receive sets could eliminate five states. 34

Why isn’t this TLS? TLSDATMTM Forward data between threads Detect when reads occur too early Discard speculative state after violations Retire speculative Writes in order Memory renaming Multi-threaded workloads Programmer/Compiler transparency 1.Txns can commit in arbitrary order. TLS has the daunting task of maintaining program order for epochs 2. TLS forwards values when they are the globally committed value (i.e. through memory) 3.No need to detect “early” reads (before a forwardable value is produced) 4.Notion of “violation” is different—we must roll back when CS is violated, TLS, when epoch ordering 5.Retiring writes in order: similar issue—we use CTM state, don’t need speculation write buffers. 6.Memory renaming to avoid older threads seeing new values—we have no such issue 35

Doesn’t this lead to cascading aborts? YES Rare in most workloads 36

Inconsistent Reads OS modifications: Page fault handler Signal handler 37

Increasing Concurrency arbitrate Lazy/Lazy (e.g. TCC) Eager/Eager (MetaTM,LogTM) DATM xend xbegin xend xbegin xend xbegin xend xbegin xend conflict xbegin T1 T2 T1 T2 T1 T2 conflict arbitrate conflict time serialized overlapped retry no retry! tctc (Lazy/Lazy: Updates buffered, conflict detection at commit time) (Eager/Eager: Updates in-place, conflict detection at time of reference) 38

Design space highlights Cache granularity: –eliminate per word access bits Timestamp-based dependences: –eliminate Order Vector Dependence-Aware contention management 39

Design Points 40

Key Ideas:  Critical sections execute concurrently  Conflicts are detected dynamically  If conflict serializability is violated, rollback Key Abstractions: Primitives –xbegin, xend, xretry Conflict Contention Manager –Need flexible policy Hardware TM Primer “Conventional Wisdom Transactionalization”: Replace locks with transactions 41

Working Set R{} W{} 0: xbegin 1: read A 2: read B 3: if(cpu % 2) 4: write C 5: else 6: read C 7: … 8: xend cpu 0cpu 1 0: xbegin; 1: read A 2: read B 3: if(cpu % 2) 4: write C 5: else 6: read C 7: … 8: xend PC: 0 Working Set R{} W{} PC: 0 Working Set R{} W{} PC: 1PC: 0PC: 1 PC: 2 Working Set R{ } W{} A PC: 2 Working Set R{ } W{} A PC: 3 Working Set R{ } W{} A,B PC: 3 Working Set R{ } W{} A,B PC: 6 PC: 4 PC: 7 Working Set R{ } W{} A,B, C PC: 7 Working Set R{ } W{ } A,B C CONFLICT: C is in the read set of cpu0, and in the write set of cpu1 Assume contention manager decides cpu1 wins: cpu0 rolls back cpu1 commits PC: 0 PC: 8 Working Set R{} W{} Hardware TM basics: example 42