Dynamic Performance Tuning of Word-Based Software Transactional Memory

Slides:



Advertisements
Similar presentations
TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.
Advertisements

Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.
Read-Write Lock Allocation in Software Transactional Memory Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University.
Part IV: Memory Management
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Principles of Transaction Management. Outline Transaction concepts & protocols Performance impact of concurrency control Performance tuning.
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel.
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
Transactional Locking Nir Shavit Tel Aviv University (Joint work with Dave Dice and Ori Shalev)
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
DMITRI PERELMAN IDIT KEIDAR TRANSACT 2010 SMV: Selective Multi-Versioning STM 1.
Elastic Transactions Pascal Felber Vincent Gramoli Rachid Guerraoui.
EPFL - March 7th, 2008 Interfacing Software Transactional Memory Simplicity vs. Flexibility Vincent Gramoli.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
TM Input Acceptance Vincent Gramoli, Derin Harmanci, Pascal Felber EPFL LPD - University of Neuchâtel Switzerland.
Client-Server Caching James Wann April 4, Client-Server Architecture A client requests data or locks from a particular server The server in turn.
Making Object-Based STM Practical in Unmanaged Environments Torvald Riegel and Diogo Becker de Brum ( Dresden University of Technology, Germany)
Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
WG5: Applications & Performance Evaluation Pascal Felber
Lowering the Overhead of Software Transactional Memory Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Consistency Oblivious Programming Hillel Avni Tel Aviv University.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
COSC3330 Computer Architecture
Lecture 12 Virtual Memory.
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Libraries
Part 2: Software-Based Approaches
PHyTM: Persistent Hybrid Transactional Memory
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Faster Data Structures in Transactional Memory using Three Paths
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Chapter 9: Virtual-Memory Management
Concurrent Data Structures Concurrent Algorithms 2017
A Qualitative Survey of Modern Software Transactional Memory Systems
Designing Parallel Algorithms (Synchronization)
Lecture: Cache Innovations, Virtual Memory
Lecture: Coherence and Synchronization
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Software Transactional Memory Should Not be Obstruction-Free
Locking Protocols & Software Transactional Memory
CSC3050 – Computer Architecture
Lecture 23: Transactional Memory
COMP755 Advanced Operating Systems
Andy Wang Operating Systems COP 4610 / CGS 5765
CSE 542: Operating Systems
CSE 542: Operating Systems
Presentation transcript:

Dynamic Performance Tuning of Word-Based Software Transactional Memory Pascal Felber University of Neuchatel Pascal.Felber@unine.ch Christof Fetzer, Torvald Riegel Dresden University of Technology PPoPP 2008

STM in a nutshell Multicores and MPs will be everywhere The “free ride” is over Concurrent programming necessary for speedup Hard to get right, impact on many developers STM can simplify concurrent programming Sequence of instructions executed atomically BEGIN … LOAD / STORE … COMMIT Optimistic execution, abort and retry on conflict A “universal” synchronization construct Transactions are composable 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Agenda Motivations TINYSTM: a lightweight STM design Dynamic tuning in TINYSTM Experimental evaluation Conclusions 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Motivations Performance of TM depends on many factors TM design choices, e.g., word-based vs. object-based, visible vs. invisible reads, lock-based vs. non-blocking, write-through vs. write-back, encounter-time vs. commit-time locking, etc. TM configuration parameters, e.g., number of locks and hash function, CM strategy and parameters, etc. …which in turn depends on runtime factors CPU type, size of cache lines, etc. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Motivations Most importantly it depends on the workload E.g., ratio of update to read-only transactions, number of locations read or written, contention on shared memory locations, etc. There is no “one-size-fits-all” STM We could benefit from dynamic tuning mechanisms 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

TINYSTM: a lightweight design Word-based lock-based STM implementation Written in portable C, 32/64-bit Small code base (<1000 LOC), GPL Memory management operations Time-based algorithm like LSA [DISC06] & TL2 [DISC06] Versioned locks used to build consistent snapshot “Classical” word-based STM design Per-stripe locks, encounter-time locking (ETL) Write-through and write-back versions Used as underlying STM in TANGER [TRANSACT07] Shared clock with roll-over Encounter-time locking First, our empirical observations appear to indicate that detecting conflicts early often increases the transaction throughput because transactions do not perform useless work. Commit-time locking may help avoid some read-write conflicts, but in general conflicts discovered at commit time cannot be solved without aborting at least one transaction. Second, encounter-time locking allows us to efficiently handle reads-after-writes without requiring expensive or complex mechanisms. This feature is especially valuable when write sets have non-negligible size. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Basic data structures COMMIT by transaction tx Acquire unique timestamp from clock If tx is not read-only and time has advanced, validate read set Write values and release locks LOAD(addr) by transaction tx Find lock for addr and read lock, value, lock If lock is owned by tx, return latest value If lock is free and version ≤ tx.ts, return latest value If lock is free and version > tx.ts, can try to “extend” snapshot (requires validation) Otherwise, abort (or defer to CM) STORE(addr) by transaction tx Find lock for addr and read lock If lock is owned by tx, write new value If lock is free, try to acquire it atomically (CAS) Otherwise, abort (or defer to CM) tx descriptor timestamp shared clock memory … read-set write-set lock bit … lock array … &p->next &n->val address 1 version stm_start(tx); … n = stm_load(tx, &p->next); v = stm_load(tx, &n->val); stm_store(tx, &p->next, n); stm_commit(tx); L-1 one-to-many mapping siezof(word) … locks[(addr >> #shifts) % L] 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Write-through vs. write-back Write-through (ETL) Writes to memory (undo log) Uses incarnation numbers on versions (ABA problem) Write-back (ETL) Buffered writes (redo log) Locks point directly to entries in redo log Faster commit Faster RW-after-write, enables compiler optimizations Faster abort Version numbers don’t change on abort (no ABA problem) 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

On validation costs Observation: long update transaction may have large validation overhead (e.g., LL) Reducing the # of locks increases false sharing Our approach: “hierarchical locking” Smaller array of H << L counters mapped to locks H partitions in read set, read and write masks Counters are atomically updated on first write of transaction to partition (keep track of progress) Validation of partition skipped if counter did not change or only updated by current transaction Efficient with large read sets and few writes 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Hierarchical locking tx descriptor timestamp shared clock memory … read-set[H] write-set lock bit read-mask:H lock array write-mask:H &p->next counters[H] … &n->val … address 1 version hierarchical array counter L-1 one-to-many mapping one-to-many mapping H-1 siezof(word) … siezof(word) counters[(addr >> #shifts) % H] locks[(addr >> #shifts) % L] 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Throughput (red-black tree) 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) L=220, #shifts=2/3 All designs scale well. 64-bit version noticeably faster. Performance of CTL and ETL is comparable (little contention). 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Throughput (linked list) 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) L=220, #shifts=2/3 All designs scale well. 64-bit version noticeably faster. CTL suffers more from long transaction (no CM). 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Size and update rates 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) L=220, #shifts=2/3 Linked list more sensitive to size than red-black tree (linear vs. logarithmic). Read-only much faster. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

…but, do they really have much impact? Dynamic tuning Three main tuning parameters in TINYSTM Mapping of addresses to locks (#shifts + 2/3) Size of lock array (L, #locks) Size of hierarchical array (H) Goal: find a good combination of these parameters for the workload at runtime …but, do they really have much impact? More parameters to come 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Impact of #shifts and #locks 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) The number of shifts and locks have impact on throughput. The “sweet spots” are not the same for all workloads. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Impact of H The hierarchical array helps much for large read sets. 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) The hierarchical array helps much for large read sets. The best value for H is not the same for all workloads. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Throughput improvement 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) Larger #locks help initially but then throughput flattens. Best #shifts depends on spatial locality of shared structure. Best H depends on size of transaction’s read set. H: too small => full validation anyhow; too large => overhead from atomic operations on counters. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Dynamic tuning strategy Start with some initial values #locks = 28 #shifts = 0 H = 1 Measure throughput Periodically update parameters at runtime (approx. every second) Hill-climbing algorithm with memory and forbidden areas to find good configuration Update parameters: costly operation (requires barrier) 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Hill-climbing algorithm 8 moves #locks: *=2, /=2 #shifts: ++, -- H: *=2, /=2 nop revert to best configuration Principle: move then verify effectiveness If performance drops significantly or when too far from best configuration, revert If performance drop is too high, forbid move Moves selected at random to explore uncharted configurations If throughput of best configuration drops, switch to second best, etc. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Throughput more than doubles from initial configuration Red-black tree 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) Throughput more than doubles from initial configuration 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Throughput almost doubles from initial configuration Linked list 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) Throughput almost doubles from initial configuration 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Validation costs (linked list) 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) Dynamic tuning allows skipping most of validation checks. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Conclusions Performance of STM depends on design and configuration parameters, and workload No “one-size-fits-all” STM Dynamic tuning adapts configuration to workload Simple hill-climbing algorithm shows significant performance improvements More configuration parameters to explore http://www.tinystm.org 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Thank you! ???????? 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

Abort rates Abort rates increase upon contention, as expected. 8-core Intel Xeon at 2 GHz, Linux 2.6.18-4 (64-bit) L=220, #shifts=2/3 Abort rates increase upon contention, as expected. 64-bit has higher abort rate. CTL has slightly less aborts. 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber

ETL vs. CTL Encounter-time locking Acquire locks when memory is written Detect conflicts early Commit-time locking Acquire locks at commit time Detects conflicts late Avoids executing doomed transactions Fast RW-after-write May reduce conflicts with some workloads 7/23/2019 Dynamic Performance Tuning of Word-Based Software Transactional Memory — P. Felber