Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
2 Motivation Multi-core architectures are mainstream –Software concurrency needed for scalability –Concurrent programming is hard –Difficult to reason about shared data Traditional mechanism: Lock-based Synchronization –Hard to use –Must be fine-grain for scalability –Deadlocks –Not easily composable New Solution: Transactional Memory (TM) –Simpler programming model: Atomicity, Consistency, Isolation –No deadlocks –Composability –Optimistic concurrency –Analogy GC : Memory allocation ≈ TM : Mutual exclusion
3 Composability class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name);// Get the current balance balance = balance + amount;// Increment it accounts.put(name, balance);// Set the new balance } … } Thread-safe – but no scaling ConcurrentHashMap (Java 5/JSR 166) does not help Performance requires redesign from scratch & fine-grain locking
4 Transactional solution class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name);// Get the current balance balance = balance + amount;// Increment it accounts.put(name, balance);// Set the new balance } … } Underlying system provide: isolation (thread safety) optimistic concurrency
5 Transactions are Composable Scalability on 16-way 2.2 GHz Xeon System
6 Our System A Java Software Transactional Memory (STM) System –Pure software implementation –Language extensions in Java –Integrated with JVM & JIT Novel Features –Rich transactional language constructs in Java –Efficient, first class nested transactions –Risc-like STM API –Compiler optimizations –Per-type word and object level conflict detection –Complete GC support
7 System Overview Polyglot ORP VM McRT STM StarJIT Transactional Java Java + STM API Transactional STIR Optimized T-STIR Native Code
8 Transactional Java Java + new language constructs: Atomic: execute block atomically atomic {S} Retry: block until alternate path possible atomic {… retry;…} Orelse: compose alternate atomic blocks atomic {S1} orelse{S2} … orelse{Sn} Tryatomic: atomic with escape hatch tryatomic {S} catch(TxnFailed e) {…} When: conditionally atomic region when (condition) {S} Builds on prior research Concurrent Haskell, CAML, CILK, Java HPCS languages: Fortress, Chapel, X10
9 Transactional Java → Java Transactional Java atomic { S; } STM API txnStart[Nested] txnCommit[Nested] txnAbortNested txnUserRetry... Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S’; break; } finally { if(!txnCommit(th)) continue; }
10 JVM STM support On-demand cloning of methods called inside transactions Garbage collection support Enumeration of refs in read set, write set & undo log Extra transaction record field in each object Supports both word & object granularity Native method invocation throws exception inside transaction Some intrinsic functions allowed Runtime STM API Wrapper around McRT-STM API Polyglot / StarJIT automatically generates calls to API
11 Background: McRT-STM STM for C / C++ (PPoPP 2006) Java (PLDI 2006) Writes: –strict two-phase locking –update in place –undo on abort Reads: –versioning –validation before commit Granularity per type –Object-level : small objects –Word-level : large arrays Benefits –Fast memory accesses (no buffering / object wrapping) –Minimal copying (no cloning for large objects) –Compatible with existing types & libraries
12 Ensuring Atomicity: Novel Combination Memory Ops Mode ↓ ReadsWrites Pessimistic Concurrency Optimistic Concurrency + Caching effects + Avoids lock operations Quantitative results in PPoPP’06 + In place updates + Fast commits + Fast reads
13 McRT-STM: Example … atomic { B = A + 5; } … stmStart(); temp = stmRd(A); stmWr(B, temp + 5); stmCommit(); … STM read & write barriers before accessing memory inside transactions STM tracks accesses & detects data conflicts
14 Transaction Record Pointer-sized record per object / word Two states: Shared (low bit is 1) –Read-only / multiple readers –Value is version number (odd) Exclusive –Write-only / single owner –Value is thread transaction descriptor (4-byte aligned) Mapping Object : slot in object Field : hashed index into global record table
15 Transaction Record: Example Every data item has an associated transaction record TxR 1 TxR 2 TxR 3 … TxR n Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } vtbl x y TxR x y vtbl Extra transaction record field Object granularity Word granularity class Foo { int x; int y; } hash x y vtbl
16 Transaction Descriptor Descriptor per thread –Info for version validation, lock release, undo on abort, … Read and Write set : { } –Ti: transaction record –Ni: version number Undo log : { } –Ai: field / element address –Oi: containing object (or null for static) –Vi: original value –Ki: type tag (for garbage collection) In atomic region –Read operation appends read set –Write operation appends write set and undo log –GC enumerates read/write/undo logs
17 McRT-STM: Example atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; } T1 atomic { t1 = bar.x; t2 = bar.y; } T2 T1 copies foo into bar T2 reads bar, but should not see intermediate values Class Foo { int x; int y; }; Foo bar, foo;
18 McRT-STM: Example stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit(); T1 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); T2 T1 copies foo into bar T2 reads bar, but should not see intermediate values
19 McRT-STM: Example stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit; T1 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); T2 hdr x = 0 y = 0 5 hdr x = 9 y = 7 3 foo bar Reads Reads T1 x = 9 Writes Undo T2 waits y = 7 7 Abort T2 should read [0, 0] or should read [9,7] Commit
20 Early Results: Overhead breakdown Time breakdown on single processor STM read & validation overheads dominate Good optimization targets
21 System Overview Polyglot ORP VM McRT STM StarJIT Transactional Java Java + STM API Transactional STIR Optimized T-STIR Native Code
22 Leveraging the JIT StarJIT: High-performance dynamic compiler Identifies transactional regions in Java+STM code Differentiates top-level and nested transactions Inserts read/write barriers in transactional code Maps STM API to first class opcodes in STIR Good compiler representation → greater optimization opportunities
23 Representing Read/Write Barriers atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Traditional barriers hide redundant locking/logging
24 An STM IR for Optimization Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 }
25 Optimized Code atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Fewer & cheaper STM operations
26 Compiler Optimizations for Transactions Standard optimizations CSE, Dead-code-elimination, … Careful IR representation exposes opportunities and enables optimizations with almost no modifications Subtle in presence of nesting STM-specific optimizations Immutable field / class detection & barrier removal (vtable/String) Transaction-local object detection & barrier removal Partial inlining of STM fast paths to eliminate call overhead
27 Experiments 16-way 2.2 GHz Xeon with 16 GB shared memory L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) Workloads Hashtable, Binary tree, OO7 (OODBMS) –Mix of gets, in-place updates, insertions, and removals Object-level conflict detection by default –Word / mixed where beneficial
28 Effective of Compiler Optimizations 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization
29 Scalability: Java HashMap Shootout Unsafe (java.util.HashMap) Thread-unsafe w/o Concurrency Control Synchronized Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) Multi-year effort: JSR 166 -> Java 5 Optimized for concurrent gets (no locking) For updates, divides bucket array into 16 segments (size / locking) Atomic Transactional version via “AtomicMap” wrapper Atomic Prime Transactional version with minor hand optimization Tracks size per segment ala ConcurrentHashMap Execution 10,000,000 operations / 200,000 elements Defaults: load factor, threshold, concurrency level
30 Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale
31 Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales
32 20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object
33 20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap
34 20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain
35 20% Inserts and Removes: Mixed-Level Mixed-level preserves wins & reduces overheads -word-level for arrays -object-level for non-arrays
36 Scalability: java.util.TreeMap 100% Gets 80% Gets Results similar to HashMap
37 Scalability: OO7 – 80% Reads “Coarse” atomic is competitive with medium-grain synchronization Operations & traversal over synthetic database
38 Key Takeaways Optimistic reads + pessimistic writes is nice sweet spot Compiler optimizations significantly reduce STM overhead % over thread-unsafe % over synchronized Simple atomic wrappers sometimes good enough Minor modifications give competitive performance to complex fine-grain synchronization Word-level contention is crucial for large arrays Mixed contention provides best of both
39 Research challenges Performance –Compiler optimizations –Hardware support –Dealing with contention Semantics –I/O & communication –Strong atomicity –Nested parallelism –Open transactions Debugging & performance analysis tools System integration
40 Conclusions Rich transactional language constructs in Java Efficient, first class nested transactions Risc-like STM API Compiler optimizations Per-type word and object level conflict detection Complete GC support
41