Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha,

Slides:



Advertisements
Similar presentations
Inferring Locks for Atomic Sections Cornell University (summer intern at Microsoft Research) Microsoft Research Sigmund CheremTrishul ChilimbiSumit Gulwani.
Advertisements

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian.
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Software Transactional Memory and Conditional Critical Regions Word-Based Systems.
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel.
Transactional Memory – Implementation Lecture 1 COS597C, Fall 2010 Princeton University Arun Raman 1.
Evaluating Database-Oriented Replication Schemes in Software Transacional Memory Systems Roberto Palmieri Francesco Quaglia (La Sapienza, University of.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Rich Transactions on Reasonable Hardware J. Eliot B. Moss Univ. of Massachusetts,
McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
IBM Software Group © 2004 IBM Corporation Compilation Technology Java Synchronization : Not as bad as it used to be! Mark Stoodley J9 JIT Compiler Team.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.
Aarhus University, 2005Esmertec AG1 Implementing Object-Oriented Virtual Machines Lars Bak & Kasper Lund Esmertec AG
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
1 The Google File System Reporter: You-Wei Zhang.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.
Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.
Java Virtual Machine Case Study on the Design of JikesRVM.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Effective Fine-Grain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives Martin Rinard University of California,
WG5: Applications & Performance Evaluation Pascal Felber
Lowering the Overhead of Software Transactional Memory Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William.
JVSTM and its applications João Software Engineering Group.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
The ATOMOS Transactional Programming Language Mehdi Amirijoo Linköpings universitet.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.
CS492B Analysis of Concurrent Programs Transactional Memory Jaehyuk Huh Computer Science, KAIST Based on Lectures by Prof. Arun Raman, Princeton University.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Hardware and Software transactional memory and usages in MRE
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
Tuning Threaded Code with Intel® Parallel Amplifier.
Hathi: Durable Transactions for Memory using Flash
James Larus and Christos Kozyrakis
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Part 2: Software-Based Approaches
PHyTM: Persistent Hybrid Transactional Memory
Martin Rinard Laboratory for Computer Science
Enforcing Isolation and Ordering in STM Systems
Concurrent Data Structures Concurrent Algorithms 2017
Lecture 6: Transactions
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Dynamic Performance Tuning of Word-Based Software Transactional Memory
CSc 453 Interpreters & Interpretation
JIT Compiler Design Maxine Virtual Machine Dhwani Pandya
Presentation transcript:

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

2 Motivation Multi-core architectures are mainstream –Software concurrency needed for scalability –Concurrent programming is hard –Difficult to reason about shared data Traditional mechanism: Lock-based Synchronization –Hard to use –Must be fine-grain for scalability –Deadlocks –Not easily composable New Solution: Transactional Memory (TM) –Simpler programming model: Atomicity, Consistency, Isolation –No deadlocks –Composability –Optimistic concurrency –Analogy GC : Memory allocation ≈ TM : Mutual exclusion

3 Composability class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name);// Get the current balance balance = balance + amount;// Increment it accounts.put(name, balance);// Set the new balance } … } Thread-safe – but no scaling ConcurrentHashMap (Java 5/JSR 166) does not help Performance requires redesign from scratch & fine-grain locking

4 Transactional solution class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name);// Get the current balance balance = balance + amount;// Increment it accounts.put(name, balance);// Set the new balance } … } Underlying system provide: isolation (thread safety) optimistic concurrency

5 Transactions are Composable Scalability on 16-way 2.2 GHz Xeon System

6 Our System A Java Software Transactional Memory (STM) System –Pure software implementation –Language extensions in Java –Integrated with JVM & JIT Novel Features –Rich transactional language constructs in Java –Efficient, first class nested transactions –Risc-like STM API –Compiler optimizations –Per-type word and object level conflict detection –Complete GC support

7 System Overview Polyglot ORP VM McRT STM StarJIT Transactional Java Java + STM API Transactional STIR Optimized T-STIR Native Code

8 Transactional Java Java + new language constructs: Atomic: execute block atomically atomic {S} Retry: block until alternate path possible atomic {… retry;…} Orelse: compose alternate atomic blocks atomic {S1} orelse{S2} … orelse{Sn} Tryatomic: atomic with escape hatch tryatomic {S} catch(TxnFailed e) {…} When: conditionally atomic region when (condition) {S} Builds on prior research Concurrent Haskell, CAML, CILK, Java HPCS languages: Fortress, Chapel, X10

9 Transactional Java → Java Transactional Java atomic { S; } STM API txnStart[Nested] txnCommit[Nested] txnAbortNested txnUserRetry... Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S’; break; } finally { if(!txnCommit(th)) continue; }

10 JVM STM support On-demand cloning of methods called inside transactions Garbage collection support Enumeration of refs in read set, write set & undo log Extra transaction record field in each object Supports both word & object granularity Native method invocation throws exception inside transaction Some intrinsic functions allowed Runtime STM API Wrapper around McRT-STM API Polyglot / StarJIT automatically generates calls to API

11 Background: McRT-STM STM for C / C++ (PPoPP 2006) Java (PLDI 2006) Writes: –strict two-phase locking –update in place –undo on abort Reads: –versioning –validation before commit Granularity per type –Object-level : small objects –Word-level : large arrays Benefits –Fast memory accesses (no buffering / object wrapping) –Minimal copying (no cloning for large objects) –Compatible with existing types & libraries

12 Ensuring Atomicity: Novel Combination Memory Ops  Mode ↓ ReadsWrites Pessimistic Concurrency Optimistic Concurrency + Caching effects + Avoids lock operations Quantitative results in PPoPP’06 + In place updates + Fast commits + Fast reads

13 McRT-STM: Example … atomic { B = A + 5; } … stmStart(); temp = stmRd(A); stmWr(B, temp + 5); stmCommit(); … STM read & write barriers before accessing memory inside transactions STM tracks accesses & detects data conflicts

14 Transaction Record Pointer-sized record per object / word Two states: Shared (low bit is 1) –Read-only / multiple readers –Value is version number (odd) Exclusive –Write-only / single owner –Value is thread transaction descriptor (4-byte aligned) Mapping Object : slot in object Field : hashed index into global record table

15 Transaction Record: Example Every data item has an associated transaction record TxR 1 TxR 2 TxR 3 … TxR n Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } vtbl x y TxR x y vtbl Extra transaction record field Object granularity Word granularity class Foo { int x; int y; } hash x y vtbl

16 Transaction Descriptor Descriptor per thread –Info for version validation, lock release, undo on abort, … Read and Write set : { } –Ti: transaction record –Ni: version number Undo log : { } –Ai: field / element address –Oi: containing object (or null for static) –Vi: original value –Ki: type tag (for garbage collection) In atomic region –Read operation appends read set –Write operation appends write set and undo log –GC enumerates read/write/undo logs

17 McRT-STM: Example atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; } T1 atomic { t1 = bar.x; t2 = bar.y; } T2 T1 copies foo into bar T2 reads bar, but should not see intermediate values Class Foo { int x; int y; }; Foo bar, foo;

18 McRT-STM: Example stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit(); T1 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); T2 T1 copies foo into bar T2 reads bar, but should not see intermediate values

19 McRT-STM: Example stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit; T1 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); T2 hdr x = 0 y = 0 5 hdr x = 9 y = 7 3 foo bar Reads Reads T1 x = 9 Writes Undo T2 waits y = 7 7 Abort T2 should read [0, 0] or should read [9,7] Commit

20 Early Results: Overhead breakdown Time breakdown on single processor STM read & validation overheads dominate  Good optimization targets

21 System Overview Polyglot ORP VM McRT STM StarJIT Transactional Java Java + STM API Transactional STIR Optimized T-STIR Native Code

22 Leveraging the JIT StarJIT: High-performance dynamic compiler Identifies transactional regions in Java+STM code Differentiates top-level and nested transactions Inserts read/write barriers in transactional code Maps STM API to first class opcodes in STIR Good compiler representation → greater optimization opportunities

23 Representing Read/Write Barriers atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Traditional barriers hide redundant locking/logging

24 An STM IR for Optimization Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 }

25 Optimized Code atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Fewer & cheaper STM operations

26 Compiler Optimizations for Transactions Standard optimizations CSE, Dead-code-elimination, … Careful IR representation exposes opportunities and enables optimizations with almost no modifications Subtle in presence of nesting STM-specific optimizations Immutable field / class detection & barrier removal (vtable/String) Transaction-local object detection & barrier removal Partial inlining of STM fast paths to eliminate call overhead

27 Experiments 16-way 2.2 GHz Xeon with 16 GB shared memory L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) Workloads Hashtable, Binary tree, OO7 (OODBMS) –Mix of gets, in-place updates, insertions, and removals Object-level conflict detection by default –Word / mixed where beneficial

28 Effective of Compiler Optimizations 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization

29 Scalability: Java HashMap Shootout Unsafe (java.util.HashMap) Thread-unsafe w/o Concurrency Control Synchronized Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) Multi-year effort: JSR 166 -> Java 5 Optimized for concurrent gets (no locking) For updates, divides bucket array into 16 segments (size / locking) Atomic Transactional version via “AtomicMap” wrapper Atomic Prime Transactional version with minor hand optimization Tracks size per segment ala ConcurrentHashMap Execution 10,000,000 operations / 200,000 elements Defaults: load factor, threshold, concurrency level

30 Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale

31 Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales

32 20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object

33 20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap

34 20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain

35 20% Inserts and Removes: Mixed-Level Mixed-level preserves wins & reduces overheads -word-level for arrays -object-level for non-arrays

36 Scalability: java.util.TreeMap 100% Gets 80% Gets Results similar to HashMap

37 Scalability: OO7 – 80% Reads “Coarse” atomic is competitive with medium-grain synchronization Operations & traversal over synthetic database

38 Key Takeaways Optimistic reads + pessimistic writes is nice sweet spot Compiler optimizations significantly reduce STM overhead % over thread-unsafe % over synchronized Simple atomic wrappers sometimes good enough Minor modifications give competitive performance to complex fine-grain synchronization Word-level contention is crucial for large arrays Mixed contention provides best of both

39 Research challenges Performance –Compiler optimizations –Hardware support –Dealing with contention Semantics –I/O & communication –Strong atomicity –Nested parallelism –Open transactions Debugging & performance analysis tools System integration

40 Conclusions Rich transactional language constructs in Java Efficient, first class nested transactions Risc-like STM API Compiler optimizations Per-type word and object level conflict detection Complete GC support

41