Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture Fall Prof. Burger
Motivation What would a realistic HTM system actually support? (primitives/design choices) Current Transactional Memory proposals make architectural design choices with inadequate information: –shared counter, linked list benchmarks –focus on user mode: avoids OS issues
HTM + OS: are you nuts? Large concurrent program with complex data access patterns Complex code: simplify programming model Many apps spend a lot of time in kernel Diverse synchronization primitives –spinlocks, semaphores, per-CPU variables, RCU, seqlocks, completions, mutexes
Our HTM System Basic primitives: –xbegin, xend OS-specific primitives: –xpush, xpop –stack management: interrupts on x86 re-use stack Configurable Hardware Parameters –Conflict detection granularity –Commit & abort penalties –Overflow costs Configurable contention management –Conflict resolution policies: which tx restarts? –Backoff policies: how long to wait before restart
An Issue Unique to an OS: Using transactions in interrupt handlers 0x10 0x20 0x30 0x40 TX #1 { 0x10 } system_call() { XBEGIN modify 0x10 XEND } intr_handler() { XPUSH XBEGIN modify 0x30 XEND XPOP } No tx in interrupts TX #1 { 0x10 } TX #2 { 0x30 } Interrupts abort active tx TX #1 { 0x10, 0x30 } Nest the transactions TX #1 { 0x10 } TX #2 { 0x30 } Multiple active transactions TX #1 { 0x10 } interrupt
Converting Linux to TxLinux TxLinux based on kernel Converted “core” primitives to use transactions –spin-locks, RCU primitives, r/w locks –critical sections become transactions Converted high traffic subsystems –memory allocators, FS directory cache, mapping addresses to pages data structures, memory mapping files into address spaces, ip routing, and socket locking Modified interrupt-handling code to use primitives in our HTM model (xpush, xpop)
HTM Implementation Implemented HTM model as x86 extensions Simulation environment –Simics machine simulator –transactional L1 cache (variable: 4k-32k) –4MB L2 ; 1GB RAM –1 cycle/instruction, 16 cycle/L1 miss, 200 cycle/L2 miss –4 & 8 processors
Experimental Setup Benchmarks –micro: kernalloc, Counter, directory cache “punisher” –macro: pmake, netcat, MAB, configure, find Measurements –Execution time –Transactions statistics: created/restarted/overflowed, working sets, footprint –Cache statistics (e.g. miss rate) Variables –Contention management (conflict/backoff policies) –Transactional cache size –Commit, abort, overflow penalties –Conflict granularity (byte vs. word vs. cache line)
TxLinux Results (4 processors) Performance change minimal, lots of transactions Unique Transaction restarts were < 0.07% Data cache miss rates do not change appreciably Transactions Created 105,972425,888475,8601,810,6021,408,610243,934
Contention Management Matters! linear back off policy, 4 processors
Conclusions TxLinux is cooler than, and has comparable performance to Linux Cache line granularity is good enough 16KB Transactional cache covers the vast majority of transactions Best contention management policy is workload dependent. Exponential back off is too conservative
Backup Slides
Contention Management Restart Rates
Conflict Granularity & Backoff Policy
Stack Management Issue Treating the Stack as a shared resource –Checkpoint –Partition
Tx’l Memory Allocator Investigation Examine Tx complexity/performance trade-off The “slab” is the default Kernel memory allocator –Highly tuned for performance –Avoids contention/locks, uses per-CPU structures –About ~3,880 lines of code The “slob” is a drop-in replacement –Designed for minimal bookkeeping memory overhead –Uses two coarse-grained locks (386 lines) The “slob-opt” is “slob” with modifications –Removed “obvious” transaction bottlenecks –Only a couple of dozen lines of code changed
Tx’l Memory Allocator Results (4 proc) KernallocPmakeMABconfigureFind slab %0.04%0.07%0.04%0% slob %19.72%5.93%0.71% slob- optimized %0.45%8.48%1.42%0.12% Execution time (in seconds) Unique restarts
Transactional Memory Issues Hardware vs. Software –Different interfaces –strong (HW) vs. weak (SW) atomicity Will transactions make programming easier? Transactions for blocking primitives? Using transactions for security?