Copyright © 2006, CS 612 Transactional Memory Architectural Support for a Lock-Free Data Structure Some material borrowed from : Konrad Lai, Microprocessor Technology Labs, Intel Intel Multicore University Research Conference Dec 8, 2005 By: Major Bhadauria
2 Copyright © 2006, CS 612 Motivation Multiple cores face a serious programmability problem Writing correct parallel programs is very difficult Transactional Memory addresses key part of the problem Makes parallel programming easier by simplifying coordination Requires hardware support for performance Implements lock-free data structures
3 Copyright © 2006, CS 612 Benefits of Going Lock-Free Priority inversion: lower-priority process is preempted while holding lock needed by a higher-priority process Convoying: A process holding a lock is de-scheduled, possibly stopping others programs from progressing needlessly. Deadlock: Common problem of process A and B both needing to lock C and D. Process B has lock for D. Process A has lock for C DEADLOCK ALLOWS US TO AVOID:
4 Copyright © 2006, CS 612 Problem: Lock-Based Synchronization Software engineering problems Lock-based programs do not compose Performance and correctness tightly coupled Timing dependent errors are difficult to find and debug Performance problems High performance requires finer grain locking More and more locks add more overhead Need a better concurrency model for multi-core software Lock-based synchronization of shared data access Is fundamentally problematic
5 Copyright © 2006, CS 612 What is Transactional Memory (TM)? Basic mechanisms: Isolation: Track read and writes, detect when conflicts occur Version management: Record new/old values Atomicity: Commit new values or abort back to old values Transactional Memory (TM) allows arbitrary multiple memory locations to be updated atomically and serially. begin_xaction A = A – 20 B = B + 20 A = A – B C = C + 20 end_xaction Thread 1 begin_xaction C = C - 30 A = A + 30 end_xaction Thread 2 Thread 1’s accesses and updates to A, B, C are atomic Thread 2 sees either “all” or “none” of Thread 1’s updates
6 Copyright © 2006, CS 612 Transactional Memory benefits Focus: Multithreaded programmability crisis Programmability & performance Allows conservative synchronization Programmer focuses on parallelism & correctness, HW extracts performance Software engineering and composability Allows library re-use and composition (locks make this very difficult) Critical for wide multi-core demand Makes high performance MT programming easier Captures a fundamental, well-known, intuitive “atomic” construct Been around for decades Similar to a “critical section” but without its problems No deadlocks, priority inversion, data races, unnecessary serialization
7 Copyright © 2006, CS 612 Software Transactional Memory Software Transactional Memory (1995 until now) Significant work from Sun, Brown, Cambridge, Microsoft Serious performance limitations Degrades “common” case of no conflicts/contention >90% of transactions are no conflicts –90% of critical sections are uncontended: what if all these slowed down by 5X? Serious deployability limitations Relies on special runtime support Invasive to applications and libraries Is STM is too slow and too invasive to deploy? Could there be better implementation? But great to understand complex usage models of the future
8 Copyright © 2006, CS 612 Herlihy & Moss’ Implementation Minimum implementation builds on previous LL/SC structure: Loading a shared value to read Writing to a shared value Commit/Abort Functions to commit or flush new values Extensions for performance: Store copy of old value in cache – reduce bus traffic, latency Store committed data in cache – reduce bus traffic Have a read value that you can later change. – reduce bus traffic Have Validate function to check current status – reduce number of orphan transactions Busy bus signal for transactions - reduce number of aborts
9 Copyright © 2006, CS 612 Herlihy & Moss’ Implementation Transactional Memory primitives: Load-transactional (LT): reads value of a shared memory location into a private register. Load-transactional-exclusive (LTX): similar to LT, but indicates that likely to be updated. Store-transactional (ST): tentatively writes a value to a shared memory location, but not visible until a successful committal. TM instructions- commit, abort, validate. Commit makes tentative changes permanent if the transaction’s read set (locations referenced by LT) has not changed, and no other process had read any location within the write set (locations referenced by LTX and ST) Abort discards updates to write set. Validate –tests the current transaction status, T-continue, F-Abort
10 Copyright © 2006, CS 612 Example
11 Copyright © 2006, CS 612 Herlihy & Moss’ Implementation Hardware Modifications (somewhat) OLD NEW
12 Copyright © 2006, CS 612 Hardware Support for Performance A.sum = 100 B.sum = 200 Core 1 begin_xaction A.withdraw(20) B.deposit(20) end_xaction Architectural Memory state A.sum = 100 B.sum = A.sum = 80 B.sum = 220 A.sum = 80 B.sum = Record recovery state 2.Buffer updates/track accesses 3.Commit if no external access (discard all updates if conflict) 1 Core 2 1 begin_xaction Sum = A.sum + B.sum end_xaction Coherence protocol for conflicts Core 2 sees $300 – never $280 or $320 A.sum = 100 B.sum = Abort & restart
13 Copyright © 2006, CS 612 Herlihy & Moss’ Benchmarks Compared against: Test-and-test-and-set (TTS) w/ exponential backoff Software queuing (QOSB mechanism) LOAD_LINKED/STORE_COND (LL/SC)
14 Copyright © 2006, CS 612 Herlihy & Moss’ Test 1 Results
15 Copyright © 2006, CS 612 Herlihy & Moss’ Implementation Limitations Transaction size implementation-dependent Must finish in a single scheduling quantum Locations accessed cannot exceed architecturally specified limit Short duration, small data sets Requires cache coherence model that’s sequentially consistent (otherwise may need fences) Nested transactions problematic Hence Programmer Must Be Aware of Hardware
16 Copyright © 2006, CS 612 What if HW is not sufficient? This is the deployability challenge The missing piece of the puzzle for all prior work… Resource limitations are fundamental Space: caches More HW delays the inevitable: will always be an n+1 case Time: scheduling quanta Programmers have no control over time Affects functionality, not just performance Some transactions may never complete Making HW limit explicit is difficult Limited usage only Unreasonable for high level languages How do you architect it in an evolvable manner?
17 Copyright © 2006, CS 612 Virtualizing Transactional Memory Hides hardware limitations like virtual memory hides physical memory limitations 2 modes 1 for common case which is built into HW 2 nd for buffer overflow, page faults, context switches or thread migration which is built using SW and HW data structures Virtualization allows transactions to: be suspended, migrate, or overflow state from local buffers, and allows nested transactions Extra challenge: Unlike normal TM, can’t only use cache coherence mechanisms, since need to detect conflicts b/w active transactions and transactions whose state partially or completely overflowed to virtual memory
18 Copyright © 2006, CS 612 Virtualize for Completeness Timer interrupts, Context switches, Exceptions,… Limited buffers Core Virtual TM virtual address space Log/buffer space Overflow management Using virtual memory Software libs. and microcode Out-of-band concurrency control Programmer transparent Performance isolation Suspendable/swappable
19 Copyright © 2006, CS 612 Recent TM Research Recently, focus on solving the harder problem of TM Making the model immune to cache buffer size limitations, scheduling limitations, etc. TCC (Stanford) (2004) Same limitation as Herlihy/Moss for TM (size limited to local caches) LTM (MIT), VTM (Intel), LogTM (Wisconsin) (2005) Assume hardware TM support Add support to allow transactions to be immune to resource limitations Goals of each similar, approaches very different LTM: only resource overflow VTM: complete virtualization LogTM: only resource overflow, AND optimize for COMMITs
20 Copyright © 2006, CS 612 Some Research Challenges Large transactions Language extensions IO, loophole, escape hatches, … Interaction and co-existence with Other synchronization schemes: locks, flags, … Other transactions Database transaction System transaction (Microsoft) Other libraries, system software, operating system, … Performance monitor, tuning, debugging, … Open vs Closed Nesting Interaction between transaction & non-transaction Usage & Workload PLDI workshop
21 Copyright © 2006, CS 612 TM – First Decade IBM 801 Database Storage (1980s) Lock attribute bits on virtual memory (via TLBs, PTEs) at 128 byte granularity Load-Linked/Store Conditional (Jensen et al. 1987) Optimistic update of single cache line (Alpha, MIPS, PowerPC) Transactional Memory (Herlihy&Moss 1993) Coined term; TM generalization of LL/SC Instructions explicitly identify transactional loads and stores Used dedicated transaction cache Size of transactions limited to transaction cache Oklahoma Update (Stone et al./IBM 1993) Similar to TM, concurrently proposed Didn’t use cache but dedicated monitored registers to operate upon
22 Copyright © 2006, CS 612 References 1.“Transactional Memory: Architectural Support for Lock-Free Data Structures”, Moss et. al, ISCA “Virtualizing Transactional Memory” K. Lai et. al, ISCA “LogTM: Log-based Transactional Memory”, Wood et. al, HPCA-12