EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,

Slides:



Advertisements
Similar presentations
QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal.
Advertisements

1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.
Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev.
1 Lecture 5: TM – Lazy Implementations Topics: TM design (TCC) with lazy conflict detection and lazy versioning, intro to eager conflict detection.
1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)
1 Lecture 7: Lazy & Eager Transactional Memory Topics: details of “lazy” TM, scalable lazy TM, implementation details of eager TM.
1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood Presented by Colleen Lewis.
Dynamic Runtime Testing for Cycle-Accurate Simulators Saša Tomić, Adrián Cristal, Osman Unsal, Mateo Valero Barcelona Supercomputing Center (BSC) Universitat.
KAUSHIK LAKSHMINARAYANAN MICHAEL ROZYCZKO VIVEK SESHADRI Transactional Memory: Hybrid Hardware/Software Approaches.
Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal.
Transactional Memory CDA6159. Outline Introduction Paper 1: Architectural Support for Lock-Free Data Structures (Maurice Herlihy, ISCA ‘93) Paper 2: Transactional.
Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman.
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark Hill, Karu Sankaralingam University of Wisconsin-Madison.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
© 2006 Mulitfacet ProjectUniversity of Wisconsin-Madison LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark.
1 Lecture 10: Transactional Memory Topics: lazy and eager TM implementations, TM pathologies.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Novel Paradigms of Parallel Programming Prof. Smruti R. Sarangi IIT Delhi.
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Mihai Burcea, J. Gregory Steffan, Cristiana Amza
Cache Coherence: Directory Protocol
Irina Calciu Justin Gottschlich Tatiana Shpeisman Gilles Pokam
Cache Coherence: Directory Protocol
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
PHyTM: Persistent Hybrid Transactional Memory
LogSI-HTM: Log Based Snapshot Isolation in
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 19: Transactional Memories III
Lecture 11: Transactional Memory
Lecture 12: TM, Consistency Models
Lecture: Consistency Models, TM
Lecture 6: Transactions
Lecture 17: Transactional Memories I
Lecture 21: Transactional Memory
Transactional Memory An Overview of Hardware Alternatives
Lecture 22: Consistency Models, TM
Lecture: Consistency Models, TM
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
Performance Pathologies in Hardware Transactional Memory
BulkCommit: Scalable and Fast Commit of Atomic Blocks
Performance Pathologies in Hardware Transactional Memory
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Transactional Memory
Presentation transcript:

EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge

Why Transactional Memory? Lock-based parallel programming has problems –Deadlocks, races, complexity, performance, … Transactional Memory (TM) to the rescue –Optimistic concurrency control mechanism –Easy to use –Deadlock free –Supports composability –Protects data in critical sections Hardware-TM (HTM), Software-TM (STM) and hybrid Lock-based parallel programming has problems –Deadlocks, races, complexity, performance, … Transactional Memory (TM) to the rescue –Optimistic concurrency control mechanism –Easy to use –Deadlock free –Supports composability –Protects data in critical sections Hardware-TM (HTM), Software-TM (STM) and hybrid 2

HTM terminology Atomic section/transaction: group of instructions that appear to take effect instantaneously Where are speculative values stored (version management): –in-place, and log the original value, or –buffered in private storage, publish on commit Conflict: TX writes where others TX reads –Detection: an action in which we check for conflicts –Resolution: an action performed to resolve the conflict Can be abort, stalling the execution, … 3

A.k.a. pessimistic Writes in-place, detects&resolves conflicts on every access LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07] Eager HTM 4 Stall W R R TX 1 TX 2 TX 3 fast commit Limited concurrency Fast commit Slow abort

A.k.a. optimistic Writes buffered, detect&resolve conflicts on commit TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07] Lazy HTM 5 W R R TX 1 TX 2 TX 3 complex commit: validate + write Fast abort Complex commit Good concurrency

The Motivation Splitting conflict management Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]): –Software begin, commit and abort –Probabilistic (signature based) conflict detection EazyHTM is the first pure-hardware TM 6 Conflict detection Eager Lazy Conflict resolution EagerLazy LogTM TCC, S-TCC Impossible EazyHTM Fast commit Good concurrency

Outline Motivation Contributions Hardware changes The Protocol Evaluation Conclusions 7

EazyHTM Contributions The best of two worlds –Eager conflict detection: simple commit/exact list of conflicts in advance –Lazy conflict resolution: good concurrency Parallel commits of non-conflicting TXs Designed for CMPs (Chip-Multiprocessors) –Use cores proximity –MESI/MOESI protocol upgrade (easier verification) 8

Hardware changes 9 Racers list – 1 bit per core Killers list – 1 bit per core SR – 1 bit per line SM – 1 bit per line TD – 1 bit per line Register file checkpoint Racers list Killers list CPU SRSR SRSR Existing cache logic Private Cache(s) SMSM SMSM TDTD TDTD Existing directory logic Directory tracks conflicts bit-vector 32 bits for 32 cores tracks conflicts bit-vector 32 bits for 32 cores holds read/write set read-only optimization bit (details in the paper) read-only optimization bit (details in the paper) core...

Racers and killers list If line is shared between two TXs: –Read-Read No conflict –Write-Read, Read-Write, Write-Write Writer adds reader TX into “racers” list –“TXs that I have to abort” list, if I commit first Reader adds writer TX into “killers” list –“TXs that can abort me” list, if they commit first We illustrate only the Write-after-Read (WAR) conflict 10

no other sharers EazyHTM Protocol Conflict Detection (1/2) 11 racers killers TX 0 racers killers TX 2 Directory 1 2 TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX Replaces GETS/GETX

TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX racers killers TX 2 Directory racers killers TX 0 1 txAccessor Reader Potential conflict 1 other sharer Writer EazyHTM Protocol Conflict Detection (2/2) 12 Remember: abort TX#0 on commit Remember: TX#2 can abort me

racers killers TX 2 racers killers TX 0 Directory Abort from TX#2 (commit) Abort Ack from TX#0 EazyHTM Protocol Conflict Resolution 13 TX#2 first came to the commit point, abort TX#0! TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX

TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX 0 other sharers EazyHTM Protocol Disjoint data => parallel commit (commit) (commit) TX#0 works with works with Directory racers killers TX 0 3 racers killers TX NO SERIALIZATION 0 other sharers

Implementation Implemented in M5, full-system simulator (Alpha) Private L1 (32KB, 4-way, 64B CL, 2 cycles) Private L2 (512KB, 8-way, 64B CL, 10 cycles) Memory (with directory, 100 cycles) ICN (2D Mesh, 10 cycles per hop) 15

Evaluation Evaluated STAMP benchmarks Compared with Scalable-TCC-like HTM –Same base simulator –Implemented specialized directory protocol Compared with ideal lazy HTM (MESI based) –magical conflict detection –instant conflict resolution –parallel write-back commit 16

Kmeans Low Small TXs (RS 15 CL; WS 5 CL) Low contention (10% aborts) Similar profile to “replacing locks with atomic” Near ideal performance K-means: groups N-dimensional space into K clusters Most of the SPLASH-2 suite has similar profile 17

SSCA2 Small TXs (RS 50 CL, WS 10 CL) Low contention (1.2% aborts) Near ideal performance Scalability affected by barriers, not by contention SSCA2: large directed graph operations 18

Yada Large TXs (260 CL RS, 140 CL WS) Moderate contention (35% aborts) We can see good performance also for large TXs! Yada: delaunay mesh refinement 19

Intruder Medium TXs (53 CL RS, 20 CL WS) High contention (85% aborts) Very bad scalability for all HTMs Every transaction detects conflicts over and over again – lot of conflict detection messages slow down the execution Intruder: signature based network intrusion detection system 20

Only high-conflict STAMP >50% abort rate only High contention high-core-count should be optimized Averages: Labyrinth Intruder Kmeans-Hi Results highly affected by Intruder 21

Only low-conflict STAMP <50% abort rate only Low abort rate necessary for scaling Excludes: Labyrinth 8-32 Intruder Kmeans-Hi 32 22

Conclusions Introduced EazyHTM, a new HTM implementation –Eager conflict detection, lazy conflict resolution –Fast: performs well for low conflict parallel applications –Minimal changes to directory protocols (easier verification) –As scalable as standard directory protocol EazyHTM mechanism could allow (future work): –Simpler transaction prioritization –Less wasted work –Better performance optimization –Power efficient TM mechanisms 23

Thank you! Questions? 24