Hardware Support for Efficient Transactional and Supervised Memory Systems Jayaram Bobba Dissertation Defense 1/14/2010 Overview: 1) Research Area 2) Challenges/ Contributions 3) Big Picture Dept. of Computer Sciences University of Wisconsin–Madison
Research Area Device Scaling Abundant Transistors Emergence of CMPs Hard to Program Hardware Support to Improve Productivity Empty/full-bits Transactional Memory MemTracker Supervised Systems Deterministic Memory 9/18/2018 Wisconsin Multifacet Project
Challenges Supervised Systems Transactional Memory Contribution 1: Sequential-consistency only Ad hoc hardware Lack of formalism Transactional Memory “Most transactions are small” Self-fulfilling Limited applicability Contribution 1: Supervised Memory TSOdata ,Safe Supervision Contribution 2: TokenTM Link TokenTM to StealthTest Contribution 3: StealthTest 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Big Picture Applications Software Tools StealthTest Supervised Systems TokenTM TSOdata and Safe Supervision Supervised Memory Hardware 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Slide Count Motivation Supervised Memory TokenTM StealthTest Conclusion 4 18 4 /19 16 6 9/18/2018 Wisconsin Multifacet Project
On Software Productivity More Software Better Hardware More Productivity Yannis’s “Law”: Programmer Productivity doubles every 6 years More Performance Moore’s Law Moore’s Law will continue But Yannis’s Law? 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project What has changed? “A Fundamental Turn towards Concurrency in Software” [Herb Sutter, 2005] Moore’s Law -> Better Computers Sequential Computers (Past) Memory wall, Power wall etc. Attack of the killer CMPs* (Current) How to program? Expose parallelism to software Parallel programs hard to write * Adapted from “Attack of the killer micros” by Eugene Brooks 9/18/2018 Wisconsin Multifacet Project
Who solves the productivity issue? Why, Of course, hardware architects! Long live Moore’s Law Spend some transistors on productivity issues Architectural Support for Enhancing Productivity for language features for bug avoidance for debugging for performance feedback and so on… 9/18/2018 Wisconsin Multifacet Project
Seriously, Who should solve it? HW Architects or SW Engineers? ‘software crisis’ in the past too… Why HW architects? More bang for the buck (Economic) Software/IT (1,152 billion) vs Hardware (138 billion) [Wen Mei Hwu, Micro-39 Keynote] SW cannot do it alone (Technical) Decades of automatic parallelization efforts Virtual Memory, Tagged Memory for LISP-like languages “We must now reconsider the balance of hardware and software and to provide more specialized function in hardware than we have previously, in order to drastically simplify the programming process” Edward A. Feustel, IEEE TOC, July 1973 in support of Tagged Memory 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models TokenTM StealthTest Conclusion 9/18/2018 Wisconsin Multifacet Project
Why Supervised Systems? Synchronization Hardware TM systems Empty/Full-bits [Berry et al 2006] Graph processing algorithms on 4 processor MTA > 64K BG/L Controlled non-determinism Deterministic/Interleaving Constrained Multiprocessing Debugging Log-based architectures Safety Heap checkers, Bounds checkers Language Features Hardware-assisted Garbage Collection 9/18/2018 Wisconsin Multifacet Project
What are Supervised Systems? out-of-band metadata per data block monitor & control (supervise) memory accesses to data execute handlers on specific metadata states pure software possible, but inefficient shadow memory E.g., Valgrind. Mean Slowdown 22X [Nethercote et al., VEE2007] Synthesized definition 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project State-of-the-Art Expect Sequentially-Consistent (SC) hardware Most hardware is not Ad hoc Whither primitives? Informal treatment of memory consistency Ambiguous/Incorrect 9/18/2018 Wisconsin Multifacet Project
Contributions Expect Sequentially-Consistent (SC) hardware Ad hoc Most hardware is not Ad hoc Whither primitives? Informal treatment of memory consistency Ambiguous/Incorrect Explore relaxed supervised systems Define Supervised Memory Propose formal memory models 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models TokenTM StealthTest Conclusion 9/18/2018 Wisconsin Multifacet Project
TSO-lite: A TSO-compliant system Explore relaxed supervised systems TSO-lite: A TSO-compliant system PC PC ST 0x01, A ST 1, [A] LD [B], r1 ST 2,[C] LD [C], r3 Processor ST 0x10, C r1 r1 0x01 ST A LD B r2 r2 r3 r3 0x10 Store Buffer Block Data A 0x00 B 0x01 C 0x11 Metadata Memory 9/18/2018 Wisconsin Multifacet Project
Empty/Full-Bits on TSO-lite Explore relaxed supervised systems Empty/Full-Bits on TSO-lite PC PC ST 0x01, A ST 1, [A] LD [B], r1 ST 2,[C] LD [C], r3 LD ST Exception Processor ST 0x10, C r1 r1 0x01 Empty Full r2 r2 r3 r3 LD Store Buffer I1: NO LOAD BYPASS EXCEPTION LD/ST Block Data A 0x00 B 0x01 C 0x11 Metadata Full None Empty None Memory I2: LATE EXCEPTIONS 9/18/2018 Wisconsin Multifacet Project
Deterministic Shared Memory (DMP) [Devietti et al., ASPLOS 2009] Explore relaxed supervised systems Deterministic Shared Memory (DMP) [Devietti et al., ASPLOS 2009] “depending upon the consistency model of the underlying hardware, threads must perform a memory fence at the edge of a quantum” Insert a fence after the last operation in the quantum Insert a fence before the first shared operation in the quantum I3: Reordered metabit-reads 9/18/2018 Wisconsin Multifacet Project Illustration
Wisconsin Multifacet Project Outline Motivation Supervised Memory Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models TokenTM StealthTest Conclusion 9/18/2018 Wisconsin Multifacet Project
What is Supervised Memory? Define Supervised Memory What is Supervised Memory? Each memory location A, data (A.d) metadata (A.m) New operations Supervised Load (sLD A) Supervised Store (sST A) Jump on reading special metadata (Optionally) Hardware exception 9/18/2018 Wisconsin Multifacet Project
Supervised Operations Define Supervised Memory Supervised Operations sLD A => Start: atomic{ curm = Val[RA.m] // Read metadata nextm = NEXT(Load, curm) // Check software- // specified FSM If nextm == EXCEPTION then Jump to Handler RA.d // Read data If (nextm != curm) then WA.m,nextm // Update metadata } Handler: … 9/18/2018 Wisconsin Multifacet Project
Using Supervised Memory Define Supervised Memory Using Supervised Memory Software assigns semantics to metadata Metastates stored as metadata E.g., Initialized, Uninitialized Metastate transition function (NEXT) Use supervised operations to monitor/control data operations E.g., catch read access to uninitialized data 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models TokenTM StealthTest Conclusion 9/18/2018 Wisconsin Multifacet Project
TSO Axioms [Hangal et al., ISCA 2004] Propose formal models TSO Axioms [Hangal et al., ISCA 2004] 9/18/2018 Wisconsin Multifacet Project
TSO Axioms [Hangal et al., ISCA 2004] Propose formal models TSO Axioms [Hangal et al., ISCA 2004] Axiom Description Order Total Order on all write accesses Atomicity No intervening accesses for atomic operations Termination All write accesses eventually complete Value Reads return latest value from memory or store buffer Memory Barrier No reordering across a barrier ReadAny Accesses cannot pass outstanding reads WriteWrite Write access cannot pass outstanding writes Reordering Axioms Rd A Rd B Rd A Wr B Wr A Wr B Wr A Rd B Allows store buffers 9/18/2018 Wisconsin Multifacet Project
TSOall: A Consistency Model for Supervised Memory Propose formal models TSOall: A Consistency Model for Supervised Memory TSO axioms applied to all accesses—data and metadata + (Simple) Like TSO — (Slow) Prohibits optimizations Thread: sST A sLD B => Store buffers ineffective Tension Ease of Reasoning vs Performance ->[Rd A.m, Wr A.d, Wr A.m] ->[Rd B.m, Rd B.d] 9/18/2018 Wisconsin Multifacet Project
Blast from the Past [Adve and Hill, ISCA1990] Propose formal models Blast from the Past [Adve and Hill, ISCA1990] Ease of Reasoning (SC) vs Performance (RC) Observation: Simple programs rely only on certain SC orders Ignore non-essential orders. Still appears as SC Challenge: Simple? Non-essential orders? Solution: Data-race-freedom For data-race-free programs, RC = SC 9/18/2018 Wisconsin Multifacet Project
Safe Supervision Motivation Propose formal models Safe Supervision Motivation Ease of Reasoning (TSOall) vs Performance (?) Observation: Simple supervised programs rely only on certain TSOall orders Ignore non-essential orders. Still appears as TSOall Challenge: Simple? Non-essential orders? Solution: Safe Supervision For safely supervised programs, ? = TSOall 9/18/2018 Wisconsin Multifacet Project Examples
Wisconsin Multifacet Project Safe Supervision metadata accesses to location A not used to order operations to a different location B Most uses of supervision are safely supervised. E.g., Heap Checker: Initialized/Uninitialized values Transactional Memory: Conflict Detection information Initially, A.m = Empty, B.d = 0 Thread 1: B.d = 1 A.m = Full Thread 2: While (A.m == Empty); Read B.d 9/18/2018 Wisconsin Multifacet Project Definition
TSOdata: Fast Yet Simple Propose formal models TSOdata: Fast Yet Simple Axiom Description Order Total Order on all write accesses Atomicity No intervening accesses for atomic operations Termination All write accesses eventually complete Value Reads return latest value from memory or store buffer Memory Barrier No reordering across a barrier ReadAny Data accesses cannot pass outstanding data reads WriteWrite Data writes cannot pass outstanding data writes Reordering Axioms ->[Rd A.m, Wr A.d, Wr A.m] Thread: sST B sLDA Store buffers can be used ->[Rd B.m, Rd B.d] For safely supervised programs, TSOdata = TSOall 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project TSOdata on OpenSPARC T2 Goal: Explore low-level issues on a real design Late Exceptions with deferred handlers Dump store buffer entries on exception Enhance store buffer to carry Virtual Address (VA) ~200 cycles to read out 4 entries Disable store buffer bypassing for supervised loads Low space overhead for adding metabits (~4%) 9/18/2018 Wisconsin Multifacet Project
Supervised Memory Summary Expects Sequentially-Consistent (SC) hardware Most hardware is not Ad hoc Whither primitives? Informal treatment of memory consistency Ambiguous/Incorrect Explore relaxed memory systems Define Supervised Memory Propose formal memory models 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM [ISCA 2008] StealthTest Conclusion Longer Version 9/18/2018 Wisconsin Multifacet Project
LogTM: Log-based Transactional Memory 9/18/2018 TokenTM Summary Current Hardware TMs Most Transactions Small & Short Running Penalize large/long transactions Too restrictive for wide-spread TM use? Hypothesis Must Support Efficient Large/Long Transactions As Well Is such an HTM even possible? Yes! TokenTM 1. LogTM’s Log to buffer unbounded values 2. Transactional Tokens for unbounded conflict detection Conflict state in memory metabits 9/18/2018 Wisconsin Multifacet Project UW-Madison Architecture Seminar
Wisconsin Multifacet Project Transactional Tokens Challenge: How to efficiently track Read/Write sets? Token Coherence [Martin03] Read/Write sets for cache coherence Solution: Transactional Tokens T tokens per memory block At least one token to read, All T tokens to write (token conflict detection) Token Metadata <c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held by thread with TID i. 9/18/2018 Wisconsin Multifacet Project
Tokens and Supervised Memory Challenge: Where to store Unbounded, Globally Accessible Token Metadata? unbounded and globally accessible Solution Supervised Memory’s Metadata Piggyback on existing Virtual Memory and Cache Coherence mechanisms Skip Animation 9/18/2018 Wisconsin Multifacet Project
TokenTM: a Large-Transaction TM New Conflict Detection Mechanism Transactional Tokens in Supervised Memory Token Coherence [Martin03] at different level Version Management Save old/new values for unbounded Write set LogTM [Moore06] undo log 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM StealthTest [PACT 2009] Conclusion 9/18/2018 Wisconsin Multifacet Project
StealthTest Summary (1/2) The Problem: fork Overhead LogTM: Log-based Transactional Memory 9/18/2018 StealthTest Summary (1/2) The Problem: fork Overhead Software testing hard Multithreading makes harder Online software testing can help Run tests on deployed software E.g., Delta Execution for patch testing [Tucek et al., ASPLOS 2009] Non-intrusive mechanisms fork (existing) Low Overhead Functionally Hidden Good Scaling fork 9/18/2018 Wisconsin Multifacet Project UW-Madison Architecture Seminar
StealthTest Summary (2/2) Solution: TM for testing Leverage Transactional Memory for online testing Non-Intrusive? transaction { test(); abort} Fast TM mechanisms Low Overhead Functionally Hidden Good Scaling Demonstrate two uses Delta Execution In vivo Testing StealthTest 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM StealthTest Online Software Testing E.g., Patch Validation StealthTest: TM for online testing Delta Execution using StealthTest In vivo Testing using StealthTest (Optionally) Conclusion 9/18/2018 Wisconsin Multifacet Project
Online Patch Validation Bug fixes can introduce more bugs Patches must be validated Online Validation [Nagaraja et al., OSDI 2004] Increased resource usage Lockstep execution Output Production Input Testing Diff 9/18/2018 Wisconsin Multifacet Project
Delta Execution [Tucek et al., ASPLOS 2009] Online Patch Validation Most patches are small Patched and Un-patched executions similar Delta Execution Run together except when they differ Prior Work Delta Execution Increased Resource Usage O P Lockstep Execution 9/18/2018 Wisconsin Multifacet Project
Delta Execution using fork Install D data Patched execution Testing Production fork Compute D data Isolate D data Merged execution Unpatched execution Time 9/18/2018 Wisconsin Multifacet Project
Multi-threading and fork ‘Park’ all other threads Install D data Patched execution Testing Production fork Compute D data Isolate D data Unpatched execution Merged execution Time Stop all threads to get a consistent memory snapshot 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project fork Poor Performance ~9.8ms for split/~106ms for merge [Tucek et al, ASPLOS 2009] Poor Scalability Web-server response rate reduced by 43% Want an alternate mechanism 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM StealthTest Online Software Testing E.g., Patch Validation StealthTest: TM for online testing Delta Execution using StealthTest In vivo Testing using StealthTest (Optionally) Conclusion 9/18/2018 Wisconsin Multifacet Project
Delta Execution using StealthTest Isolate patched execution Introspect patched execution Monitor delta data access Delta Execution StealthTest Transactional Memory transaction{…} Version Management Tracks new/old values Conflict Detection Monitor accesses Execute on child process Page diffing mprotect fork 9/18/2018 Wisconsin Multifacet Project
StealthTest Interface Isolate patched execution Introspect patched execution Monitor delta data access Delta Execution ST_begin_transaction ST_abort_transaction ST_get_old ST_get_new ST_protect_set ST_protect_clear StealthTest Transactional Memory transaction{…} Version Management Tracks new/old values Conflict Detection Monitor accesses 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Requirements from TM Strong Atomicity [Martin et al., CAL 2006] Transactions isolated from non-transactions => Test transactions isolated from application code Flexible Conflict Resolution Can abort transactions if necessary => Abort tests if they block application Communication from within transactions => Expose result of a test 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM StealthTest Online Software Testing E.g., Patch Validation StealthTest: TM for online testing Delta Execution using StealthTest In vivo Testing using StealthTest (Optionally) Conclusion 9/18/2018 Wisconsin Multifacet Project
Delta Execution using StealthTest Install D data Patched execution Testing fork Production fork Compute and Isolate D data Merged execution Unpatched execution transaction Compute and Isolate D data Patched execution Introspect and rollback Unpatched execution Install D data Merged execution StealthTest Production ST_get_new ST_get_old ST_begin_tran… ST_abort_tran… ST_protect_set 9/18/2018 Wisconsin Multifacet Project
Multi-threaded Delta Execution Install D data Patched execution Testing fork Production fork Compute and Isolate D data Merged execution Unpatched execution transaction Compute and Isolate D data Patched execution Introspect and rollback Unpatched execution Install D data Merged execution StealthTest Original ST_get_new ST_get_old ST_begin_tran… ST_abort_tran… ST_protect_set 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Evaluation (1) Effective? (2) Non-intrusive? Workloads Collection of multi-threaded server apps Same as Tucek et al., ASPLOS 2009 Pin-based TM Emulation 2-way SMP with 2.4GHz Pentium 4 CPUs and 2.5GB RAM 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project (1) Effective? Program Description Patch Patch Verified? fork StealthTest Crafty Chess App Code refactoring P Raytrace Raytracer Result reporting fix Tar Archive Util Incremental archiving fix Apache1 Web Server Buffer overflow fix Apache2 DNSCache DNS Cache Behavior Change MySQL5.0 DB Server Extra permission checks O OpenSSL Security Lib Added bug in TLS handling Squid Web Cache ATPhttpd Works Memory allocation sockets 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project (2) Non-intrusive? Program Description fork ForkOverhead(%) PatchDuration(%) Crafty Chess App 0.1 <0.1 Raytrace Raytracer 0.2 0.5 Tar Archive Util 41 7.3 Apache1 Web Server 2.8 Apache2 12 DNSCache DNS Cache 65 MySQL5.0 DB Server 4.7 5.0 OpenSSL Security Lib Squid Web Cache 2.9 ATPhttpd 0.8 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM StealthTest Online Software Testing E.g., Patch Validation StealthTest: TM for online testing Delta Execution using StealthTest In vivo Testing using StealthTest Conclusion 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project StealthTest Summary Software testing hard Online software testing can help Existing mechanisms inadequate StealthTest leverages TM for non-intrusive online testing Demonstrate two uses Delta Execution In vivo Testing Low Overhead Functionally Hidden Good Scaling StealthTest 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Outline Motivation Supervised Memory TokenTM StealthTest Conclusion 9/18/2018 Wisconsin Multifacet Project
Contribution 1: Supervised Memory [Under Submission] Supervised Systems – Useful, Renewed interest Problem SC only, while most systems are not Ad hoc hardware, specific to a supervised system No Formalism, leads to ambiguity/incorrectness Contributions Explore non-SC systems General model for supervision: Supervised Memory Formal Specification 9/18/2018 Wisconsin Multifacet Project
Contribution 2: TokenTM [Bobba et al., ISCA 2008] Transactional Memory, a supervised system Problem “Most transactions are small”, Self-fulfilling assumption Penalize large/long transactions Too restrictive for wide-spread TM use? Contributions TokenTM First HTM to support efficient large/long transactions as well Follow-up: Purdue’s LiteTM [Jafri et al., HPCA 2010] 9/18/2018 Wisconsin Multifacet Project
Contribution 3: StealthTest [Bobba et al., PACT 2009] Using transactional memory for testing Problem Existing fork-based mechanisms High overhead Poor scalability Contributions StealthTest, low-overhead interface for online testing Two StealthTest-based testing frameworks 9/18/2018 Wisconsin Multifacet Project
Other Research and Contributions Performance Pathologies Bobba et al., ISCA 2007 Bobba et al., IEEE Micro Top Picks Jan 2008 LogTM-SE Yen et al., HPCA 2007 Nested LogTM Moravan et al., ASPLOS 2006 LogTM Moore et al., HPCA 2006 GEMS LogTM-SE Implementation Development, Release and Support 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Acknowledgments Advisors Mark Hill, David Wood Mike Swift, Ben Liblit, Shan Lu, Karu Sankaralingam Mikko Lipasti, Jeffrey Naughton Co-authors Kevin Moore, Luke Yen, Haris Volos, Michelle Moravan, Weiwei Xiong, Neelam Goyal Colleagues Alaa Alameldeen, Arkaprava Basu, Brad Beckmann, Polina Dudnik, Dan Gibson, Mike Marty, Somayeh Sardashti, Rathijit Sen, Cong Wang, Yasuko Watanabe, Min Xu Matt Allen, Piramanayagam Arumuga Nainar, Siddharth Barman, Koushik Chakraborthy, Venkat Govindraju, Amit Kumar, Srinath Sridharan, Philip Wells 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Key Contributions Applications Supervised Memory Systems Hardware Software TSOdata and Safe Supervision Tools TokenTM StealthTest 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Backup 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project CPU Trends From “The Free Lunch Is Over” by Herb Sutter 9/18/2018 Wisconsin Multifacet Project
The ‘Re-Birth’ of Parallel Programming Sequential Computers Memory wall, Power wall etc. Attack of the killer CMPs* General-purpose parallel computers How to program? Expose parallelism to software “The Free Lunch is Over” – Herb Sutter, 2005 * Adapted from “Attack of the killer micros” by Eugene Brooks 9/18/2018 Wisconsin Multifacet Project
Parallel Programming is Hard (currently) Hard for programmers Correctness Synchronization, Data races, Atomicity violations Performance Communication, Scheduling, Load-Balancing, Critical Path Hard for tools Compilers, Static Analysis Intractable/Inefficient 9/18/2018 Wisconsin Multifacet Project
Houston, We Have a Problem! Who should solve this problem? Yannis’s Law: Programmer Productivity doubles every 6 years http://ix.cs.uoregon.edu/~yannis/law.html Proebsting's Law: Compiler Advances Double Computing Power Every 18 Years http://research.microsoft.com/en-us/um/people/toddpro/papers/law.htm 9/18/2018 Wisconsin Multifacet Project
Parallel Algorithms vs Moore’s Law “Improvement resulting from … algorithmic speedup is comparable to that resulting from from the hardware speedup due to Moore’s Law over the same length of time” David E. Keyes “A Science-Based Case for Large-Scale Simulation”, July 2003. 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project In the “Landscape of Parallel Computing Research…” [Asanovic et al., 2006] 9/18/2018 Wisconsin Multifacet Project
Why not Tagged Memory? [Gehringer and Keedy, CAN 1985] Type information in tags Arguments do not apply to dynamically-typed languages like Lisp For other languages, Simpler but more specialized designs Compilers improved to make the proposals moot 9/18/2018 Wisconsin Multifacet Project
Existing proposals assume SC Explore relaxed memory systems Existing proposals assume SC Assume SC or don’t deal with multiprocessors Proposal Base Architecture Implementation WWT MIPS SC Tapeworm LogTM SPARC OneTM Informing Memory MIPS, Alpha SafeMem x86 MemTracker DMP 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project DMP Correctness 9/18/2018 Wisconsin Multifacet Project
Non-TSOall Executions 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Propose formal models TSOdata is Complex Empty/full-bits sST Initial State: A.d = 0, A.m = None B.d = 0, B.m = Empty Empty Full T0: dST 1, A sLD B T1: sST B, 1 dLD A sLD sST Can dLD A return 0? Exception 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Safe Supervision 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project 9/18/2018 Wisconsin Multifacet Project
TokenTM Logical Operation Thread X Thread Y PC BEGIN_XACT Undo Log Undo Log PC BEGIN_XACT Load A Load A Store B ABORT Store A COMMIT_XACT COMMIT_XACT Shared Memory Block Data A 0x..00.. B C 0x..10.. Metadata <cx, cy, …> <0,0,…> Insufficient tokens <0,0,…> <1,1,…> <1,0,…> B: 0x..00.. 0x..11.. 0x..00.. <0,0,…> <T,0,…> 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Existing HTM Systems Assumption: Most transactions small & short running Optimized for small transactions Degrade with large, long running transactions Non-localized Overhead, E.g., LogTM-SE [Yen07] false conflicts OneTM [Blundel07] serializes Complex, Expensive Operations, E.g., XTM [Chung06]& PTM [Chuang06] manipulate page tables Premature Optimization? 9/18/2018 Wisconsin Multifacet Project
Why Large Transactions? LogTM: Log-based Transactional Memory 9/18/2018 Why Large Transactions? Programmers may want large (>>cache) and/or long (>> ctx switch) transactions HLL transactions invoke unpredictable lower-level code Replace critical sections containing syscalls or I/O Avoid concurrency bugs [Lu08] But “Most transactions small & short running” Restrict TM to use by gurus (like OS spin locks)? Self fulfilling prophesy? Must Support Efficient Large/Long Transactions As Well 9/18/2018 Wisconsin Multifacet Project UW-Madison Architecture Seminar
Toward a Large-Transaction TM Efficiently detect conflicts between in-flight transactions using Read/Write Sets Unbounded Globally accessible Fast read/write set ops. E.g., Add to read set Clear read set Small Transactions: Low Overhead Large Transactions: Localized Overhead Accessible read/write set (potentially unbounded) N Minimal Changes to Coherence / VM Heavyweight eviction ops Negative acks Additional page tables O 9/18/2018 Wisconsin Multifacet Project
Existing Mechanisms × × Synergy between cache coherence and conflict detection Hence, overload cache coherence + Excellent for bounded/small TM But, - ‘Virtualization’ on overflows - Tough to access ‘virtualized’ state Small Transactions: Low Overhead Minimal Changes to Coherence / VM × Large Transactions: Localized Overhead × 9/18/2018 Wisconsin Multifacet Project
TokenTM: a Large-Transaction TM New Conflict Detection Mechanism Transactional Tokens in Tagged Memory Token Coherence [Martin03] at different level Version Management Save old/new values for unbounded Write set LogTM [Moore06] undo log This Talk 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Transactional Tokens Challenge: How to efficiently track Read/Write sets? Token Coherence [Martin03] Read/Write sets for cache coherence Solution: Transactional Tokens T tokens per memory block At least one token to read, All T tokens to write (token conflict detection) Token Metadata <c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held by thread with TID i. 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Tagged Memory Challenge: Where to store Unbounded, Globally Accessible Token Metadata? Virtual Memory unbounded and globally accessible Solution, similar to OneTM [Blundel07] Tag Virtual Memory Piggyback on existing Virtual Memory and Cache Coherence mechanisms 9/18/2018 Wisconsin Multifacet Project
TokenTM Logical Operation Thread X Thread Y PC BEGIN_XACT Undo Log Undo Log PC BEGIN_XACT Load A Load A Store B ABORT Store A COMMIT_XACT COMMIT_XACT Shared Memory Block Data A 0x..00.. B C 0x..10.. Metadata <cx, cy, …> <0,0,…> Insufficient tokens <0,0,…> <1,1,…> <1,0,…> B: 0x..00.. 0x..11.. 0x..00.. <0,0,…> <T,0,…> 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Storing Metadata Unbounded Difficult to access globally Thread X Thread Y PC PC BEGIN_XACT BEGIN_XACT Undo Log Token log Undo Log Token log Load A Load A Cx CY Store B Store A COMMIT_XACT COMMIT_XACT Software Tagged Memory Hardware Block Data A 0x..00.. B C 0x..10.. Metadata <cx, cy, …> <0,0,…> Metastate (Sum, TID) (0, -) Concise Accessible Lossy Summary 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Hardware Metastate Metadata summary (sum, TID) sum, total number of tokens acquired TID, identify owner when sum = 1 or sum = T (optional) Some summaries, Concise -> Stored in packed field (e.g., State[1:2] , Attr[3:16]) Fast -> Accessed as part of normal memory operation <c0, c1, …, ci, …> (sum, TID) <0, 0, 0, 0> (0, -) <0, 0, 1, 0> (1, 2) <0, T, 0, 0> (T, 2) <0, 1, 1, 1> (3, -) 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Token Logs Distributed structures for unbounded Read/Write sets per-thread stored in program memory (e.g., heap) list of <address, num_tokens> Accessible to hardware for fast ops Add to read set -> Append to token log Token log A: 1 B: T 9/18/2018 Wisconsin Multifacet Project
Double-entry Bookkeeping (Keeping Metadata Consistent) Thread X Thread Y PC PC BEGIN_XACT Token log Token log BEGIN_XACT Logical Token State Load A Load A Store B A: 1 A: 1 Store A Metadata <cx, cy, …> COMMIT_XACT COMMIT_XACT Software <1,0,…> <1,1,…> <0,0,…> Hardware <0,0,…> Block Metastate (Sum, TID) A B C <0,0,…> (2, -) (0, -) (1, X) (0, -) (0, -) 9/18/2018 Wisconsin Multifacet Project
Implementing Hardware Metastate Thread X Thread Y BEGIN_XACT Token log Token log BEGIN_XACT Load A Load A Store B A: 1 Store A COMMIT_XACT COMMIT_XACT Software Load A Load A Coherence State Coherence State Hardware Tag Data Tag Data Sum TID Sum TID Private Caches A 1 X - Modified Exclusive Owned 0x..00.. 0x..00.. 1, X A Shared 0x..00.. 1 X Data A DATA A Fwd_GETS A Metastate (Sum, TID) (0,0) GETS A GETS A Upgrade A Block Directory Data Sum TID 0, - Main Memory A Exclusive @ P1 Not Present Shared @ P1,P2 0x..00.. 0x..00.. - Shared copies cannot update metastate Solution: Fission / Fusion 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Metastate Fission Thread X Thread Y BEGIN_XACT Token log Token log BEGIN_XACT Load A Load A A: 1 A: 1 Store B Store A COMMIT_XACT COMMIT_XACT Software Hardware 1,X fission Load A Coherence State Coherence State Tag Data Sum TID Tag Data Sum TID Private Caches 1,X 0,- A Owned Modified 0x..00.. 1 X A Shared 0x..00.. 1 - Y 0x..00.. Data A GETS A Fwd_GETS A Block Directory Data Sum TID Main Memory A Exclusive @ P1 Shared @ P1,P2 0x..00.. 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Metastate Fusion Metastate Fusion On store, metastate copies fused back Why does fission/fusion work? Store sees ‘complete’ metastate Load sees ‘complete’ metastate, if writer exists ‘partial’ metastate, otherwise 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Hardware Cost Additional metabits in caches/memory Recoded ECC to cull metabits Changes to coherence protocols Additional payload on messages Minimal changes to protocol logic Requires non-silent eviction 9/18/2018 Wisconsin Multifacet Project
Evaluation Methodology Full System Simulation Multifacet GEMS Base System 32-core CMP system, in-order, single-issue cores Private 4-way 32KB writeback split I&D L1 caches Shared 8-way 8 MB writeback L2 On-chip directory @ L2, MESI coherence Packet-switched interconnect in a tiled topology 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project TM Systems LogTM-SE [Yen07] variant Parallel Bloom Filters for conflict detection 4 2Kbit H3 filters + Compact, less hardware overhead - False Conflicts LogTM-SE_Perfect + No False Conflicts - Unimplementable TokenTM 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Results Large Transactions: Localized Overhead Minor degradation with large transactions Comparable on small transactions Small Transactions: Low Overhead 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project 9/18/2018 Wisconsin Multifacet Project
In vivo Testing [Murphy et al. TR 2007, Chu et al. ICST 2008] Run unit tests on deployed software + More testing + More realistic Catch bugs early 9/18/2018 Wisconsin Multifacet Project
In vivo Testing using StealthTest ST_begin_transaction(); try { test(); ST_begin_escape(); fprintf(log, “…”, success); ST_end_escape(); } catch/except() { fprintf(log, “…”, fail); } ST_abort_transaction(NO_RETRY); 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Evaluation Workloads Bugbench Server Workloads STAMP Transactional Memory benchmarks Implementation Intel STM Language-Based TM TL2 STM Library-Based TM Quad-core workstation with RHEL5 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project (1) Effective? Built on Intel STM. Run tests on Bugbench applications Program Description Size (LOC) Bug Type Error Detected? NCOM file compress 1.9K Stack Smash Yes POLY file “unixier” 0.7K GZIP 8.2K Buffer Overflow MAN documentation 4.7K BC calculator 17.0K HTPD1 web server 224K Atomicity SQUD proxy cache 93.5K Possible CVS version control 114.5K Double Free MSQL2 DBMS 514K MSQL3 1028K Works Unsupported Library Calls 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project (2) Non-intrusive? Built on TL2 STM. Run tests on STAMP applications (1000 tests per min) 9/18/2018 Wisconsin Multifacet Project
Atomicity Violation Bugs? 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Degree-2 Transactions Isolate only writes. Implementation Reads in escape action Early Release Add new type of transaction to TM 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project StealthTest Wish List Hardware Support System Calls within Transactions Interaction between Locks and Transactions 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project 9/18/2018 Wisconsin Multifacet Project
Wisconsin Multifacet Project Related Work Binary Translation (SPROCKETS) Code Emulation (STEM) TLS (Oplinger&LAM, PathExpander) 9/18/2018 Wisconsin Multifacet Project
In vivo Testing Motivation Ordering Bug in MySQL In vivo Test: Data Consistency Checks Buggy Code (In mysys/thr_lock.c): void thr_lock_delete(THR_LOCK *lock) { … pthread_mutex_destroy(&lock->mutex); list_delete(thr_lock_thread_list, &lock->list); } 9/18/2018 Wisconsin Multifacet Project