Download presentation
Presentation is loading. Please wait.
Published byJared Dickerson Modified over 9 years ago
1
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge
2
Why Transactional Memory? Lock-based parallel programming has problems –Deadlocks, races, complexity, performance, … Transactional Memory (TM) to the rescue –Optimistic concurrency control mechanism –Easy to use –Deadlock free –Supports composability –Protects data in critical sections Hardware-TM (HTM), Software-TM (STM) and hybrid Lock-based parallel programming has problems –Deadlocks, races, complexity, performance, … Transactional Memory (TM) to the rescue –Optimistic concurrency control mechanism –Easy to use –Deadlock free –Supports composability –Protects data in critical sections Hardware-TM (HTM), Software-TM (STM) and hybrid 2
3
HTM terminology Atomic section/transaction: group of instructions that appear to take effect instantaneously Where are speculative values stored (version management): –in-place, and log the original value, or –buffered in private storage, publish on commit Conflict: TX writes where others TX reads –Detection: an action in which we check for conflicts –Resolution: an action performed to resolve the conflict Can be abort, stalling the execution, … 3
4
A.k.a. pessimistic Writes in-place, detects&resolves conflicts on every access LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07] Eager HTM 4 Stall W R R TX 1 TX 2 TX 3 fast commit Limited concurrency Fast commit Slow abort
5
A.k.a. optimistic Writes buffered, detect&resolve conflicts on commit TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07] Lazy HTM 5 W R R TX 1 TX 2 TX 3 complex commit: validate + write Fast abort Complex commit Good concurrency
6
The Motivation Splitting conflict management Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]): –Software begin, commit and abort –Probabilistic (signature based) conflict detection EazyHTM is the first pure-hardware TM 6 Conflict detection Eager Lazy Conflict resolution EagerLazy LogTM TCC, S-TCC Impossible EazyHTM Fast commit Good concurrency
7
Outline Motivation Contributions Hardware changes The Protocol Evaluation Conclusions 7
8
EazyHTM Contributions The best of two worlds –Eager conflict detection: simple commit/exact list of conflicts in advance –Lazy conflict resolution: good concurrency Parallel commits of non-conflicting TXs Designed for CMPs (Chip-Multiprocessors) –Use cores proximity –MESI/MOESI protocol upgrade (easier verification) 8
9
Hardware changes 9 Racers list – 1 bit per core Killers list – 1 bit per core SR – 1 bit per line SM – 1 bit per line TD – 1 bit per line Register file checkpoint Racers list Killers list CPU SRSR SRSR Existing cache logic Private Cache(s) SMSM SMSM TDTD TDTD Existing directory logic Directory tracks conflicts bit-vector 32 bits for 32 cores tracks conflicts bit-vector 32 bits for 32 cores holds read/write set read-only optimization bit (details in the paper) read-only optimization bit (details in the paper) core...
10
Racers and killers list If line is shared between two TXs: –Read-Read No conflict –Write-Read, Read-Write, Write-Write Writer adds reader TX into “racers” list –“TXs that I have to abort” list, if I commit first Reader adds writer TX into “killers” list –“TXs that can abort me” list, if they commit first We illustrate only the Write-after-Read (WAR) conflict 10
11
txMark @A ACK @A, 0... no other sharers EazyHTM Protocol Conflict Detection (1/2) 11 racers killers TX 0 racers killers TX 2 sharers @A Directory 1 2 TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX Replaces GETS/GETX
12
TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX racers killers TX 2 sharers @A Directory racers killers TX 0 ACK @A, 1 txAccessor #2, @A txMark @A Reader #0, @A Potential conflict 1 other sharer Writer #2, @A EazyHTM Protocol Conflict Detection (2/2) 12 Remember: abort TX#0 on commit Remember: TX#2 can abort me 1 23 4 5
13
racers killers TX 2 racers killers TX 0 sharers @A Directory Abort from TX#2 WR @A (commit) Abort Ack from TX#0 EazyHTM Protocol Conflict Resolution 13 TX#2 first came to the commit point, abort TX#0! 1 1 2 3 TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX
14
TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX 0 other sharers EazyHTM Protocol Disjoint data => parallel commit 14 txMark @B... txMark @A ACK @A, 0 WR @A (commit) WR @B (commit) TX#0 works with line @ATX#2 works with line @B sharers @A Directory sharers @B 11 ACK @B, 0 22 racers killers TX 0 3 racers killers TX 2 3... NO SERIALIZATION 0 other sharers
15
Implementation Implemented in M5, full-system simulator (Alpha) Private L1 (32KB, 4-way, 64B CL, 2 cycles) Private L2 (512KB, 8-way, 64B CL, 10 cycles) Memory (with directory, 100 cycles) ICN (2D Mesh, 10 cycles per hop) 15
16
Evaluation Evaluated STAMP benchmarks Compared with Scalable-TCC-like HTM –Same base simulator –Implemented specialized directory protocol Compared with ideal lazy HTM (MESI based) –magical conflict detection –instant conflict resolution –parallel write-back commit 16
17
Kmeans Low Small TXs (RS 15 CL; WS 5 CL) Low contention (10% aborts) Similar profile to “replacing locks with atomic” Near ideal performance K-means: groups N-dimensional space into K clusters Most of the SPLASH-2 suite has similar profile 17
18
SSCA2 Small TXs (RS 50 CL, WS 10 CL) Low contention (1.2% aborts) Near ideal performance Scalability affected by barriers, not by contention SSCA2: large directed graph operations 18
19
Yada Large TXs (260 CL RS, 140 CL WS) Moderate contention (35% aborts) We can see good performance also for large TXs! Yada: delaunay mesh refinement 19
20
Intruder Medium TXs (53 CL RS, 20 CL WS) High contention (85% aborts) Very bad scalability for all HTMs Every transaction detects conflicts over and over again – lot of conflict detection messages slow down the execution Intruder: signature based network intrusion detection system 20
21
Only high-conflict STAMP >50% abort rate only High contention high-core-count should be optimized Averages: Labyrinth Intruder Kmeans-Hi Results highly affected by Intruder 21
22
Only low-conflict STAMP <50% abort rate only Low abort rate necessary for scaling Excludes: Labyrinth 8-32 Intruder 16-32 Kmeans-Hi 32 22
23
Conclusions Introduced EazyHTM, a new HTM implementation –Eager conflict detection, lazy conflict resolution –Fast: performs well for low conflict parallel applications –Minimal changes to directory protocols (easier verification) –As scalable as standard directory protocol EazyHTM mechanism could allow (future work): –Simpler transaction prioritization –Less wasted work –Better performance optimization –Power efficient TM mechanisms 23
24
Thank you! Questions? sasa.tomic@bsc.es 24
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.