Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Maurice Herlihy (DEC), J. Eliot & B. Moss (UMass)
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Distributed Systems 2006 Styles of Client/Server Computing.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan,
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel - Presentation By Sathish P.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
A Transaction-Friendly Dynamic Memory Manager for Embedded Multicore Systems Maurice Herlihy Joint with Thomas Carle, Dimitra Papagiannopoulou Iris Bahar,
KAUSHIK LAKSHMINARAYANAN MICHAEL ROZYCZKO VIVEK SESHADRI Transactional Memory: Hybrid Hardware/Software Approaches.
Sangman Kim, Michael Z. Lee, Alan M. Dunn, Owen S. Hofmann, Xuan Wang, Emmett Witchel, Donald E. Porter 1.
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
The Linux Kernel: A Challenging Workload for Transactional Memory Hany E. Ramadan Christopher J. Rossbach Emmett Witchel Operating Systems & Architecture.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
WG5: Applications & Performance Evaluation Pascal Felber
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan
Hardware and Software transactional memory and usages in MRE
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Solving Difficult HTM Problems Without Difficult Hardware Owen Hofmann, Donald Porter, Hany Ramadan, Christopher Rossbach, and Emmett Witchel University.
Transactional Memory Student Presentation: Stuart Montgomery CS5204 – Operating Systems 1.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
ECE 1747: Parallel Programming Short Introduction to Transactions and Transactional Memory (a.k.a. Speculative Synchronization)
Adaptive Software Lock Elision
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Mihai Burcea, J. Gregory Steffan, Cristiana Amza
Free Transactions with Rio Vista
Speculative Lock Elision
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Part 2: Software-Based Approaches
PHyTM: Persistent Hybrid Transactional Memory
Faster Data Structures in Transactional Memory using Three Paths
Changing thread semantics
Lecture 6: Transactions
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E
Transactional Memory An Overview of Hardware Alternatives
Free Transactions with Rio Vista
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
Kernel Synchronization II
Lecture 23: Transactional Memory
Advanced Operating Systems (CS 202) Memory Consistency and Transactional Memory Feb. 6, 2019.
Presentation transcript:

Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin

Concurrency is here Core count ever increasing Parallel programming is difficult ◦ Synchronization perilous ◦ Performance/complexity Many ideas for simpler parallelism ◦ TM, Galois, MapReduce 2 core 16 core 80 core

Transactional memory Better performance from simple code ◦ Change performance/complexity tradeoff Replace locking with memory transactions ◦ Optimistically execute in parallel ◦ Track read/write sets, roll back on conflict ◦ W A ∩ (R B ∪ W B ) != ∅ ◦ Commit successful changes

TM ain't easy TM must be fast ◦ Lose benefits of concurrency TM must be unbounded ◦ Keeping within size not easy programming model TM must be realizable ◦ Implementing TM an important first step

TM ain't easy Version and detect conflicts with existing structures ◦ Cache coherence, store buffer FastRealizableUnbounded Best-effort HTM ✔

TM ain't easy Version and detect conflicts with existing structures ◦ Cache coherence, store buffer Simple modifications to processor ◦ Very realizable (stay tuned for Sun Rock) FastRealizableUnbounded Best-effort HTM ✔✔

TM ain't easy Version and detect conflicts with existing structures ◦ Cache coherence, store buffer Simple modifications to processor ◦ Very realizable (stay tuned for Sun Rock) Resource-limited ◦ Cache size/associativity, store buffer size FastRealizableUnbounded Best-effort HTM ✔✔✖

TM ain't easy Software endlessly flexible ◦ Transaction size limited only by virtual memory Slow ◦ Instrument most memory references FastRealizableUnbounded Best-effort HTM ✔✔✖ STM ✖ ✔✔

TM ain't easy Versioning unbounded data in hardware is difficult ◦ Unlikely to be implemented FastRealizableUnbounded Best-effort HTM ✔✔✖ STM ✖ ✔✔ Unbounded HTM ✔✖✔

TM ain't easy Tight marriage of hardware and software ◦ Disadvantages of both? FastRealizableUnbounded Best-effort HTM ✔✔✖ STM ✖ ✔✔ Unbounded HTM ✔✖✔ Hybrid TM ~~ ✔

FastRealizableUnbounded Best-effort HTM ✔✔✖ Back to basics Cache-based HTM ◦ Speculative updates in L1 ◦ Augment cache line with transactional state ◦ Detect conflicts via cache coherence Operations outside transactions can conflict ◦ Asymmetric conflict ◦ Detected and handled in strong isolation

FastRealizableUnbounded Best-effort HTM ✔✔✖ Back to basics Transactions bounded by cache ◦ Overflow because of size or associativity ◦ Restart, return reason Not all operations supported ◦ Transactions cannot perform I/O

FastRealizableUnbounded Best-effort HTM ✔✔✖ Back to basics Transactions bounded by cache ◦ Software finds another way Not all operations supported ◦ Software finds another way

FastRealizableUnbounded Best-effort HTM ✔✔✔ Maximum benefit Creative software and ISA makes best-effort unbounded TxLinux ◦ Better performance from simpler synchronization Transaction ordering ◦ Make best-effort unbounded

Linux: HTM proving ground Large, complex application(s) ◦ With different synchronization Jan. 2001: Linux 2.4 ◦ 5 types of synchronization ◦ ~8,000 dynamic spinlocks ◦ Heavy use of Big Kernel Lock Dec. 2003: Linux 2.6 ◦ 8 types of synchronization ◦ ~640,000 dynamic spinlocks ◦ Restricted Big Kernel Lock use

Linux: HTM proving ground Large, complex application ◦ With evolutionary snapshots Linux 2.4 ◦ Simple, coarse synchronization Linux 2.6 ◦ Complex, fine-grained synchronization

Linux: HTM proving ground Modified Andrew Benchmark

Linux: HTM proving ground Modified Andrew Benchmark

HTM can help 2.4 Software must back up hardware ◦ Use locks Cooperative transactional primitives ◦ Replace locking function ◦ Execute speculatively, concurrently in HTM ◦ Tolerate overflow, I/O ◦ Restart, (fairly) use locking if necessary acquire_lock(lock) release_lock(lock) acquire_lock(lock) release_lock(lock)

HTM can help 2.4 Software must back up hardware ◦ Use locks Cooperative transactional primitives ◦ Replace locking function ◦ Execute speculatively, concurrently in HTM ◦ Tolerate overflow, I/O ◦ Restart, (fairly) use locking if necessary cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock)

HTM can help 2.4 Software must back up hardware ◦ Use locks Cooperative transactional primitives ◦ Replace locking function ◦ Execute speculatively, concurrently in HTM ◦ Tolerate overflow, I/O ◦ Restart, (fairly) use locking if necessary cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock) TX A TX B

HTM can help 2.4 Software must back up hardware ◦ Use locks Cooperative transactional primitives ◦ Replace locking function ◦ Execute speculatively, concurrently in HTM ◦ Tolerate overflow, I/O ◦ Restart, (fairly) use locking if necessary cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock) TX A do_IO() TX B

HTM can help 2.4 Software must back up hardware ◦ Use locks Cooperative transactional primitives ◦ Replace locking function ◦ Execute speculatively, concurrently in HTM ◦ Tolerate overflow, I/O ◦ Restart, (fairly) use locking if necessary cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock) TX A Locking B

HTM can help 2.4 Software must back up hardware ◦ Use locks Cooperative transactional primitives ◦ Replace locking function ◦ Execute speculatively, concurrently in HTM ◦ Tolerate overflow, I/O ◦ Restart, (fairly) use locking if necessary cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock) TX A Locking B

Adding HTM Spinlocks: good fit for best-effort transactions ◦ Short, performance-critical synchronization ◦ cxspinlocks (SOSP '07) 2.4 needs cooperative transactional mutexes ◦ Must support blocking ◦ Complicated interactions with BKL ◦ cxmutex ◦ Must modify wakeup behavior

Adding HTM, cont. Reorganize data structures ◦ Linked lists ◦ Shared counters ◦ ~120 lines of code Atomic lock acquire ◦ Record locks ◦ Acquire in transaction ◦ Commit changes Linux 2.4   TxLinux 2.4 ◦ Change synchronization, not use cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock) stat++ TX B

Adding HTM, cont. Reorganize data structures ◦ Linked lists ◦ Shared counters ◦ ~120 lines of code Atomic lock acquire ◦ Record locks ◦ Acquire in transaction ◦ Commit changes Linux 2.4   TxLinux 2.4 ◦ Change synchronization, not use cx_begin(lock) cx_end(lock) cx_begin(lock) cx_end(lock) stat[CPU]++ TX B

Adding HTM, cont. Reorganize data structures ◦ Linked lists ◦ Shared counters ◦ ~120 lines of code Atomic lock acquire ◦ Record locks ◦ Acquire in transaction ◦ Commit changes Linux 2.4   TxLinux 2.4 ◦ Change synchronization, not use cx_begin(lock) do_IO() cx_end(lock) cx_begin(lock) do_IO() cx_end(lock) TX B

Adding HTM, cont. Reorganize data structures ◦ Linked lists ◦ Shared counters ◦ ~120 lines of code Atomic lock acquire ◦ Record locks ◦ Acquire in transaction ◦ Commit changes Linux 2.4   TxLinux 2.4 ◦ Change synchronization, not use cx_begin(lock) do_IO() cx_end(lock) cx_begin(lock) do_IO() cx_end(lock) acquire_locks() TX B

Adding HTM, cont. Reorganize data structures ◦ Linked lists ◦ Shared counters ◦ ~120 lines of code Atomic lock acquire ◦ Record locks ◦ Acquire in transaction ◦ Commit changes Linux 2.4   TxLinux 2.4 ◦ Change synchronization, not use cx_begin(lock) do_IO() cx_end(lock) cx_begin(lock) do_IO() cx_end(lock) do_IO() Locking B

Evaluating TxLinux MAB  Modified Andrew Benchmark dpunish  Stress dcache synchronization find  Parallel find + grep config  Parallel software package configure pmake  Parallel make

Evaluation: MAB 2.4 wastes 63% kernel time synchronizing

Evaluation: dpunish 2.4 wastes 57% kernel time synchronizing

Evaluation: config 2.4 wastes 30% kernel time synchronizing

From kernel to user Best-effort HTM means simpler locking code ◦ Good programming model for kernel ◦ Fall back on locking when necessary ◦ Still permits concurrency HTM promises transactions ◦ Good model for user ◦ Need software synchronization fallback ◦ Don’t want to expose to user ◦ Want concurrency

Software, save me! HTM falls back on software transactions ◦ Global lock ◦ STM Concurrency ◦ Conflict detection ◦ HTM workset in cache ◦ STM workset in memory ◦ Global lock – no workset Communicate between disjoint SW and HW ◦ No shared data structures

Hardware, save me! HTM has strong isolation ◦ Detect conflicts with software ◦ Restart hardware transaction ◦ Only if hardware already has value in read/write set Transaction ordering ◦ Commit protocol for hardware ◦ Wait for concurrent software TX ◦ Resolve inconsistencies ◦ Hardware/OS contains bad side effects

Transaction ordering char* r int idx

Transaction ordering Transaction A begin_transaction() r[idx] = 0xFF end_transaction() Transaction A begin_transaction() r[idx] = 0xFF end_transaction() Transaction B begin_transaction() r = new_array idx = new_idx end_transaction() Transaction B begin_transaction() r = new_array idx = new_idx end_transaction() Invariant: idx is valid for r char* r int idx

Transaction ordering Transaction A begin_transaction() r[idx] = 0xFF end_transaction() Transaction A begin_transaction() r[idx] = 0xFF end_transaction() Transaction B begin_transaction() r = new_array idx = new_idx end_transaction() Transaction B begin_transaction() r = new_array idx = new_idx end_transaction() Invariant: idx is valid for r Inconsistent read causes bad data write char* r int idx

Transaction ordering A(HW) B(HW) read: write: read: write: r = old_array idx = old_idx time r=new_array idx=new_idx r[idx] = 0xFF

Transaction ordering A(HW) B(HW) read: write: r read: write: r = new_array idx = old_idx time r[idx] = 0xFF r=new_array idx=new_idx

r[idx] = 0xFF time Transaction ordering A(HW) B(HW) read: write: r read: r write: r = new_array idx = old_idx r=new_array idx=new_idx Conflict!

r[idx] = 0xFF time Transaction ordering A(HW) B(HW) read: write: r read: write: r = new_array idx = old_idx r=new_array idx=new_idx Restart

Transaction ordering A(HW) B(HW) read: write: read: write: r = old_array idx = old_idx time r=new_array idx=new_idx r[idx] = 0xFF

r=new_array idx=new_idx Transaction ordering A(SW) B(HW) read: write: idx = old_idx r = old_array time r[idx] = 0xFF

r=new_array idx=new_idx Transaction ordering A(SW) B(HW) read: write: idx = old_idx r = new_array time r[idx] = 0xFF

r=new_array idx=new_idx r[idx] = 0xFF time Transaction ordering A(SW) B(HW) read: r write: idx = old_idx r = new_array Conflict not detected

Transaction ordering new_array[old_idx] = 0xFF A(SW) B(HW) read: r, idx write: r[idx] idx = old_idx r = new_array time r=new_array idx=new_idx r[idx] = 0xFF

Transaction ordering new_array[old_idx] = 0xFF A(SW) B(HW) read: r, idx write: r[idx] idx = old_idx r = new_array time r=new_array idx=new_idx r[idx] = 0xFF Oh, no!

time Transaction ordering Hardware contains effects of B ◦ Unless B commits A(SW) B(HW) read: r, idx write: r[idx] idx = old_idx r = new_array r=new_array idx=new_idx r[idx] = 0xFF

time Transaction ordering A(SW) B(HW) read: r, idx write: r[idx] idx = old_idx r = new_array r=new_array idx=new_idx r[idx] = 0xFF Hardware contains effects of B ◦ Unless B commits

time Transaction ordering A(SW) B(HW) read: r, idx write: r[idx] r = new_array r=new_array idx=new_idx r[idx] = 0xFF Hardware contains effects of B ◦ Unless B commits idx = new_idx

time Transaction ordering A(SW) B(HW) read: r, idx write: r[idx] r = new_array r=new_array idx=new_idx r[idx] = 0xFF Hardware contains effects of B ◦ Unless B commits idx = new_idx Asymmetric conflict! Asymmetric conflict!

time Transaction ordering A(SW) B(HW) read: a, idx write: a[idx] r = new_array r=new_array idx=new_idx r[idx] = 0xFF Restart Hardware contains effects of B ◦ Unless B commits idx = new_idx

Software + hardware mechanisms Commit protocol ◦ Hardware commit waits for any current software TX ◦ Implemented as sequence lock Operating system ◦ Inconsistent data can cause spurious fault ◦ Resolve faults by TX restart Hardware ◦ Even inconsistent TX must commit correctly ◦ Pass commit protocol address to transaction_begin()

Transaction ordering Safe, concurrent hardware and software Evaluated on STAMP benchmarks ◦ ssca2 – graph kernels ◦ vacation – reservation system  High-contention  Low-contention ◦ yada – Delauney mesh refinement Best-effort + Single Global Lock STM (ordered) Idealized HTM (free)

Evaluation: ssca2 No overflow – performance == ideal

Evaluation: vacation-low <1% overflow – performance == ideal

Evaluation: vacation-high ~3% overflow – performance near ideal

Evaluation: yada ~11% overflow – software bottleneck 85% execution spent in software

Evaluation Small overflow rates, performance near ideal ◦ Typical overflow unknown ◦ TxLinux 2.4: <1% Can be limited by software synchronization ◦ Global lock: yada has long critical path ◦ STM can help

Related work Best-effort HTM ◦ Herlihy & Moss ISCA '93 ◦ Sun Rock (Dice et al. ASPLOS '09) Speculative Lock Elision ◦ Rajwar & Goodman MICRO '01, ASPLOS '02 Hybrid TM ◦ Damron et al. ASPLOS '06 ◦ Saha et al. MICRO '06 ◦ Shriraman et al. TRANSACT '06

We have the technology TxLinux 2.4 ◦ Add concurrency to simpler locking Transaction ordering ◦ Best-effort becomes unbounded Creative software + simple ISA additions