Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT
Good: Hardware Transactional Memory (HTM) HTM may always fail due to: 1.L1 cache capacity 2.Interrupt 3.Unsupported instruction Bad: The HTM is “best-effort” To ensure progress, we need a software fallback
Thread 1Thread 2 1. HTM Start 2. Read lock and check it is free code … 4. HTM Commit 1. HTM Start 2. Read lock and check it is free code … 4. HTM Commit No conflict – HTMs commit concurrently No conflict – HTMs commit concurrently A Possible Solution is: Lock Elision 1. Lock 2. Unlock
Thread 1Thread 2 1. HTM Start 2. Read lock and check it is free code … 1. HTM Start 2. Read lock and check it is free code … No concurrency between hardware and software Thread 3 1. HTM Start 2. Read lock and check it is free code …3.... FAIL … HTM Restart 1. Acquire Lock code … 3. Release Lock CONFLICT … HTM Restart CONFLICT … HTM Restart Wait for Lock A Possible Solution is: Lock Elision
Good – Simple: No need to instrument reads and writes Bad: – Serial fallback: A software fallback grabs the global lock and aborts all hardware transactions A Possible Solution is: Lock Elision
Thread 1Thread 2 1. HTM Start 2. Read lock and check it is free code … 1. HTM Start 2. Read lock and check it is free code … Thread 3 1. HTM Start 2. Read lock and check it is free code …3.... FAIL … HTM Restart 1. STM Start code … 3. … more code … more code … more code STM and HTM execute concurrently Another Approach is: Hybrid Transactional Memory
Good – Hardware-Software Concurrency Bad: – Complex: 1.Hard to coordinate hardware and software 2.Hard to apply to code due to instrumentation Another Approach is: Hybrid Transactional Memory Our focus GCC C/C++ TM helps here a lot
2006: First Hybrid TM [DamronFedorovaLevLuchangcoMoirNussbaum] – Key Idea: Use per location metadata version- locks to coordinate hardware and software Bad: – Hardware is slow: on each read/write must read the version-lock and execute a branch condition check Hybrid TM History
2007: Phased TM [LevMoirNussbaum] – Key Idea: Use HTM mode or STM mode, but not HTM and STM at the same time Bad: – Expensive to switch modes: a single fallback must stop all hardware Hybrid TM History
2011: Hybrid Norec (state-of-the-art) [DalessandroCarougeWhiteLevMoirScottSpear] – Key Idea: No metadata + global clock for coordination Hybrid TM History
Good – No metadata: Efficient for low concurrency Bad: – Limited Scalability: too much aborts due to global clock updates A software write must abort all hardware A hardware write must abort all software Hybrid NOrec
Slow-Path: Software Read X (pure) Lock clock ABORT X = 4 Fast-Path: Hardware Unlock clock Read clock Read X Read clock RESTART Update clock Read X (verify clock) Read X: check clock => changed => restart/revalidate
2011: Hybrid NOrec 2 [RiegelMarlierNowackFelberFetzer] – Key Idea: Use non-speculative reads inside HTM to verify the global clock and avoid unnecessary aborts Bad: – HTM of Intel and IBM has no support for non- speculative reads A Possible Solution
2014: Invyswell Hybrid [CalciuGottschlichShpeismanPokamHerlihy] – Key Idea: Allow unsafe concurrency between hardware and software, and use the HTM sandboxing to detect and handle errors A Recent Approach
Invyswell Slow-Path: Software Read X (NEW) Lock clock X = 4 (NEW) Read Y (OLD) Func(X, Y): Unsafe Hopes HTM aborts Y = 8 (NEW) Unlock clock Update clock Fast-Path: Hardware NO ABORT FUTURE
Good – Much less aborts than Hybrid Norec Bad: – Unfortunately, HTM sandboxing may miss errors, so a corrupted transactions may commit and crash the system: – This problem was shown in a recent work: “Pitfalls of Lazy Subscription” by [DiceHarrisKoganLevMoir] Invyswell
2015: RH NOrec [MatveevShavit] – Key Idea: Use a “mixed” fallback path, that uses both software and short hardware transactions Our New Approach
RH NOrec Slow-Path: Software Read X (NEW) Lock clock X = 4 (NEW) Read Y (OLD) Func(X, Y): Unsafe Hopes HTM aborts Y = 8 (NEW) Unlock clock Update clock Fast-Path: Hardware X = 4 (HIDDEN) Y = 8 (HIDDEN) HTM X and Y both OLD or both NEW – not a mix Read X (OLD) Read Y (OLD) Func(X, Y) Safe! A Writes are speculative (invisible) Mixed Slow-Path
Key Point 1: Execute software writes in a short hardware transaction – No need to abort hardware transactions – Full safety In practice this works well – Due to the 80:20 rule: a typical operation has 80% reads and 20% writes RH NOrec
Key Point 2: Execute a maximal amount of initial software reads in a read-only hardware transaction – Allows to defer the global clock read, and significantly reduce the software restarts/revalidations RH NOrec
HTM start …reads/writes… Update clock HTM commit Fast-Path: Hardware Mixed Path Read clock RESTART Read some X: check clock => changed => restart/revalidate … reads in software … (verifies clock)
HTM start …reads/writes… Update clock HTM commit HTM start …reads in HTM… (pure/direct) Read clock HTM commit HTM Prefix Fast-Path: Hardware Mixed Path NO ABORT
HTM start …reads/writes… Update clock HTM commit HTM start …reads in HTM… (pure/direct) Read clock HTM commit HTM Prefix …reads in software… HTM start HTM commit HTM Postfix Lock clock …writes in HTM… Unlock clock HTM start Update clock HTM commit NO ABORT …reads/writes…
Throughput on 8-core Intel (GCC C/C++)
RH Norec: a new Hybrid TM that is safe and scalable Key Idea: Use a “mixed” fallback path that uses two short hardware transactions: 1.HTM Prefix: Executes a maximal amount of initial reads – defers the global clock read 2.HTM Postfix: Executes the software writes – preserves safety and allows hardware- software concurrency Conclusion
Thank You