Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Virendra J

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.
Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Nonblocking Transactions Without Indirection Using Alert-on-Update Michael Spear Arrvindh Shriraman Luke Dalessandro Sandhya Dwarkadas Michael Scott University.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood Presented by Colleen Lewis.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
KAUSHIK LAKSHMINARAYANAN MICHAEL ROZYCZKO VIVEK SESHADRI Transactional Memory: Hybrid Hardware/Software Approaches.
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
TRANSACT 2006 Hardware Acceleration of Software Transactional Memory 1 Hardware Acceleration of Software Transactional Memory Arrvindh Shriraman, Virendra.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Lowering the Overhead of Software Transactional Memory Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Processes and Virtual Memory
Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Translation Lookaside Buffer
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Mihai Burcea, J. Gregory Steffan, Cristiana Amza
James Larus and Christos Kozyrakis
Cache Coherence: Directory Protocol
8 July 2015 Charles Reiss
Cache Coherence: Directory Protocol
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Software Coherence Management on Non-Coherent-Cache Multicores
Distributed Shared Memory
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Architecture and Design of AlphaServer GS320
PHyTM: Persistent Hybrid Transactional Memory
Multiscalar Processors
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory By McKenney, Michael, Triplett and Walpole.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 19: Transactional Memories III
Lecture 2: Snooping-Based Coherence
Lecture 23: Cache, Memory, Virtual Memory
A Qualitative Survey of Modern Software Transactional Memory Systems
Lecture 6: Transactions
Lecture: Cache Innovations, Virtual Memory
Translation Lookaside Buffer
Improving IPC by Kernel Design
Lecture 9: Directory Protocol Implementations
Hybrid Transactional Memory
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
The University of Adelaide, School of Computer Science
Performance Pathologies in Hardware Transactional Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Performance Pathologies in Hardware Transactional Memory
Lecture 23: Virtual Memory, Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

An Integrated Hardware-Software Approach to Flexible Transactional Memory Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Virendra J. Marathe, Sandhya Dwarkadas, and Michael L. Scott www.cs.rochester.edu/research/synchronization

Transactional Memory Implementation Hardware Transactional Memory (HTM) + library compatible, fast if no pathologies - rigid policy, virtualization support expensive, no migration path Software Transactional Memory (STM) + flexible policy (conflict ,escape actions), hardware compatibility - slow (always ?), library compatibility hard Best-effort TMs + simplifies future hardware, runs on current hardware - rigid policy, hardware inflexible, performance cliffs e.g., TCC, UTM, LogTM, VTM, PTM, BulkTM e.g., RSTM, DSTM, McRT, TL2, SXM e.g., HyTM, Intel Hybrid TM 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

- slower than HTMs, library compatibility (compiler support?) Our Approach Hardware-Software Transactions hardware to accelerate STMs and support your favorite policy hardware that supports flexible software implementation software routines to support uncommon events (i.e., overflows, context switches, paging) + flexible policy, supports today’s hardware, accelerates STMs, multiple uses for acceleration hardware - slower than HTMs, library compatibility (compiler support?) e.g., RTM (this talk), AOU_N (yesterday at SPAA 2007) 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Programmable-Data-Isolation Data Structures in TM HTM cache entry STM organization R W TAG Data Meta Data Data Conflict resolution Version management & Conflict resolution Version management Flexible Transactional Memory kmp A TAG Meta Data R W TAG Data Alert-On-Update for conflict detection Programmable-Data-Isolation for data versioning 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Why ? Decoupled conflict detection and version management for flexible policy and usage Conflict detection Eager, at first read/write to a shared data Lazy, prior to commit of speculative updates Mixed, eager write-write and lazy read-write and more..... Flexible software contention managers arbitrate among conflicting transactions 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

STM Overheads RSTM [TRANSACT ’06] Overheads targeted 79% 21% 34% 43% 42% Runtime SW RBTree Copying : Buffering of speculative modifications to ensure isolation Validation: Verifying consistency of accessed locations For workload description, please see the paper 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Flexible Transactional Memory Leave policy decisions in software multiple-writer coherence for data isolation at software’s behest HW provides conflict detection, SW specifies resolution policy Minimize the validation overhead Alert-on-update provides fast event based communication of remote memory operations Eliminate copying overhead Programmable data isolation allows software to employ private caches as thread local buffers Use software mechanisms to accommodate virtualization (i.e., cache overflows, paging, thread switches) Why keep policy in SW 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Alert-On-Update (AOU) ISA includes an instruction, ALoad, that loads an address and marks the cache line A-tagged line on invalidation jumps to a software handler masks further alerts until exit from alert handler Alerts can be due to capacity, cache cannot track update events on evicted line coherence, remote processor has acquired exclusive access Cache Entry A TAG Data Caveat: AOU support cannot extend across events that exhaust space and time Advantages: general, lightweight, simple, and fine-grained 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Programmable Data Isolation (PDI) ISA provides TStore and TLoad to isolate data in cache line TMI buffers/isolates TStores supports concurrent speculative writers; BusTRdX ignored supports concurrent readers; BusRd threatened and data response suppressed TI isolates concurrent readers from speculative writers values written by other TStores are isolated; a threatened read results in dropping to TI 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Programmable Data Isolation (PDI) TI lines isolate concurrent readers from speculative writers are dropped without alerting processor allow caching; drop to I on revert or commit TStored (TMI) lines buffer speculative stores must remain in cache or HW alerts active thread drop to M on commit, I on revert Support R-W and W-W concurrent sharers (if SW wants) no global consensus in HW required for committing commit is entirely local; SW responsible for correctness For details on coherence protocol and tag encoding, please see TR 910 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Putting things together Decoupled hardware for version management (PDI) and conflict detection (AOU) accelerating common TM operations Many feasible software libraries to implement and export transaction constructs handle time and space exhaustion control runtime policy RTM is an object-level, indirection based TM. 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

RTM Data Structure Runtime SW associates a metadata header with every object. An Object can denote a semantic entity or a group of memory locations. Conflict detection Metadata per Object Transaction Descriptor Owner Status Serial # Serial # New Data uncommitted Current Data (if versioning in SW) Overflow Readers committed reader bitmap to track transactions not using HW support Data Versioning N cache lines 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

FastPath Transactions (Validation + Copying) Program Data Begin_hw_t abort_pc ALD TxD_2 ALD OH(A) TLD A TST A CAS OH(A) CAS-Commit TxD_2 TxD_1 TxD_2 COMMIT ACTIVE COMMIT OH(A) CAS AOU Owner #S PDI In Cache A (current) Overflow Readers Do not overflow time or space resources ALoad descriptor to detect concurrent active transactions ALoad object header to detect ownership changes TStore updates are isolated in private cache 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Overflow Transactions Program Data TxD_2 Begin_sw_t abort_pc ALD TxD_2 LD OH(A) ........... ST A’ CAS OH(A) CAS-Commit TxD_2 TxD_1 COMMIT ACTIVE COMMIT OH(A) CAS AOU Owner In Cache #S A current A’ new version Overflow Readers ALoad descriptor to detect concurrent active transactions To Read, update overflow-reader list to notify future requestors To Write, copy current version and buffer speculative updates 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

MESI coherence protocol TMESI Prototype SPARC v9 1.2GHz 64KB I&D, 4-way 2-cycle access 32 entry VB MESI coherence protocol 1P 2P 16P ………. 4-ary ordered tree 1-cycle link delay 64 bytes/cycle I$ D$ I$ D$ I$ D$ 8MB,8way,4banks 20-cycle bank delay Snoopy Interconnect Shared L2$ Memory 100-cycle DRAM access The simulation infrastructure is based on the SIMICS + Multifacet GEMS framework Our thanks to the Wisconsin Multifacet group for distributing the GEMS toolset 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Runtime Systems CGL (Coarse Grain Lock) RTM-F(astpath) - Validation, Copying RTM-O(verflow) - Validation, Copying RTM-Lite* - Validation, Copying RSTM (Invisible + Eager) [Transact’06] Benchmarks 33% lookup, 33%insert, 33%delete operations on HashTable (256 buckets), RBTree RBTree-Large (256byte entry), LinkedList-Rel, LFUCache (255 queue + 2048 array), RandomGraph What is invisible * For a detailed description of Lite transactions, please see the paper 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Normalized Throughput RTM-F Scales RBTree-Large 0.25 0.5 0.75 1 1.25 1.5 1.75 2 4 8 16 Threads Normalized Throughput CGL RTM-F RTM-Lite RTM-O RSTM 1.9X CGL, 1thread = 1 2X 2X RTM-F improves performance and provides good scalability - at 2 threads its 50% slower than CGL1 but at 16 threads its 1.8X faster RTM-O’s performance is as good as RSTM on a CMP (Avg: 6% variation) 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Hardware accelerates Software 16 Threads CGL, 1thread = 1 1.5X 1.6X 1.7X 1.7X 1.8X RTM-F’s speedup over RTM-Lite is proportional to copying overhead - HashTable (5%), LFUCache (14%), RBTree-Large(45%) RTM-Lite presents an attractive HW cost/performance tradeoff - 45% slower than RTM-F on our most copy heavy benchmark 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Conflict Policy Important! 6 Hash 5 4 Eager Normalized Throughput 3 2 1 X-Axis, Threads 1 2 4 8 16 RandomGraph 1 0.8 Lazy Normalized Throughput 0.6 0.4 0.2 Livelock 1 2 4 8 16 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Conflict Policy Important! In applications with low degree of sharing Eager as good as lazy Lazy imposes higher bookkeeping overheads In applications with high degree of sharing Lazy eliminates livelock anomalies Lazy exploits R-W and W-W sharing Lazy narrows conflict window to attain more commits HashTable (Eager is 21% faster) and RBTree (Eager is 10% slower) LFUCache (Lazy is 28% faster) and RandomGraph (lazy eliminates livelocks) 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

To Take Home Decouple hardware for versioning and conflict detection to enable flexible software TM policy and non-TM uses Flexible conflict detection and management to eliminate performance anomalies Use software to handle the uncommon cases 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Questions Download RSTM version 3.0 at Arrvindh Mike Hemayet Virendra Sandhya Michael Download RSTM version 3.0 at http://www.cs.rochester.edu/research/synchronization/ 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Backup 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Future Work How to enable flexible usage of hardware ? semantics, concurrent use, programmer interface Simplify metadata organization Extend to scalable protocols and compare with pure HTM system Strong Isolation and Privatization 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

RTM Interface BEGIN_TX (handler_ptr, mode [H/S]) 4. Acquire ownership of written objects in their metadata at either - open (i.e. eager) + reduces wasted work, - possible livelock, reduced concurrency (not even R-W sharing) - end_tx (i.e. lazy) + increased concurrency, livelock freedom - more wasted work, requires lazy versioning 3. Read and speculatively update objects 2. Open object metadata before reading/writing object data 5. If Active, switch status to commited. 1. Start transaction in (Fastpath/Overflow) mode and save abort-handler PC BEGIN_TX (handler_ptr, mode [H/S]) const integer* rd_X = X  open_RO() const integer* rd_Y = Y  open_RO() integer* wr_Z = Z  open_RW() *wr_Z = (*rd_X) x (*rd_Y) END_TX Z = X + Y ≡ 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Protocol Animation Shared L2 P0 P1 P2 1 4 2 3 5 TGetX TLoad A 4 TLoad A 3 2 L1 TStore B TStore A 5 L1 L1 TLoad B AS: OH(A) AE: OH(A) AS: OH(A) AS: OH(A) TEE: A TII: A TMI: A TII: A AS: OH(B) AE: OH(B) AS: OH(B) TMI: B TII: B TGetX Shared L2 Cache line size objects: A,B Object Metadata: OH(A), OH(B) 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Protocol Animation Shared L2 Commit Commit Abort P0 P1 P2 4 1 2 3 5 7 TLoad A 4 TLoad A 3 2 TStore A 5 L1 TStore B L1 L1 TLoad B AS: OH(A) I: OH(A) AS: OH(A) M: OH(A) AS: OH(A) S: OH(A) 7 TII: A I: A TMI: A M: A I: A TII: A 6 Acquire OH(A) CAS-Commit CAS-Commit S: OH(B) AS: OH(B) AS: OH(B) S: OH(B) I: B TMI: B I: B TII: B GetX Shared L2 Cache line size objects: A,B Object metadata: OH(A), OH(B) 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

Lite Transaction (Validation) To read ALoad object header to detect object ownership acquisition To write ALoad descriptor to detect concurrent transactions stealing ownership Clone object and buffer modifications Acquire ownership and pointers to perform logical update 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

What is the serial number for ? How does A-tags differ from Intel-HASTM Privatization 2X is not enough, why are you slow ? What about strong isolation ? What about 2 modified lines 11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory

11/28/2018 An Integrated Hardware-Software Approach to Flexible Transactional Memory