A Dynamic Binary-Rewriting Approach to Software Transactional Memory appeared in PACT 2007, Brasov, Romania University of Toronto Marek Olszewski Jeremy.

Slides:



Advertisements
Similar presentations
TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.
Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Pay-to-use strong atomicity on conventional hardware Martín Abadi, Tim Harris, Mojtaba Mehrara Microsoft Research.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
DMITRI PERELMAN ANTON BYSHEVSKY OLEG LITMANOVICH IDIT KEIDAR DISC 2011 SMV: Selective Multi-Versioning STM 1.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
DMITRI PERELMAN IDIT KEIDAR TRANSACT 2010 SMV: Selective Multi-Versioning STM 1.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
EPFL - March 7th, 2008 Interfacing Software Transactional Memory Simplicity vs. Flexibility Vincent Gramoli.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
XCalls: Safe I/O in Memory Transactions Haris Volos, Andres Jaan Tack, Neelam Goyal +, Michael Swift, Adam Welc § University of Wisconsin - Madison + §
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
CS333 Intro to Operating Systems Jonathan Walpole.
Making Object-Based STM Practical in Unmanaged Environments Torvald Riegel and Diogo Becker de Brum ( Dresden University of Technology, Germany)
©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.
Software Transactional Memory system for C++ Serge Preis, Ravi Narayanaswami Intel Corporation.
1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Lowering the Overhead of Software Transactional Memory Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
JIT Instrumentation – A Novel Approach To Dynamically Instrument Operating Systems Marek Olszewski Keir Mierle Adam Czajkowski Angela Demke Brown University.
Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.
StealthTest: Low Overhead Online Software Testing Using Transactional Memory Jayaram Bobba, Weiwei Xiong*, Luke Yen †, Mark D. Hill, and David A. Wood.
Consistency Oblivious Programming Hillel Avni Tel Aviv University.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Hardware and Software transactional memory and usages in MRE
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
1 JIFL: JIT Instrumentation Framework for Linux Marek Olszewski Adam Czajkowski Keir Mierle University of Toronto.
Solving Difficult HTM Problems Without Difficult Hardware Owen Hofmann, Donald Porter, Hany Ramadan, Christopher Rossbach, and Emmett Witchel University.
Transactional Memory Student Presentation: Stuart Montgomery CS5204 – Operating Systems 1.
Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE)
Adaptive Software Lock Elision
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Mihai Burcea, J. Gregory Steffan, Cristiana Amza
Irina Calciu Justin Gottschlich Tatiana Shpeisman Gilles Pokam
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
PHyTM: Persistent Hybrid Transactional Memory
Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*
Faster Data Structures in Transactional Memory using Three Paths
Effective Data-Race Detection for the Kernel
Hybrid Transactional Memory
Locking Protocols & Software Transactional Memory
Deferred Runtime Pipelining for contentious multicore transactions
Dynamic Performance Tuning of Word-Based Software Transactional Memory
Controlled Interleaving for Transactions
Dynamic Binary Translators and Instrumenters
Presentation transcript:

A Dynamic Binary-Rewriting Approach to Software Transactional Memory appeared in PACT 2007, Brasov, Romania University of Toronto Marek Olszewski Jeremy Cutler Greg Steffan

2 The Parallel Programming Challenge  Coarse-grained locking Easy to program Scales poorly   Fine-grained locking Scales well Hard to get right   eg., deadlock, priority inversion, etc.  The promise of Transactional Memory As easy to program as coarse-grained locking Performance/scalability of fine-grained locking

3 Transactional Memory (TM) Source Code:... atomic {... access_shared_data();... }... TM System Specifies threads/transactions in source code... atomic {... access_shared_data();... }... atomic {... access_shared_data();... } Transactions: Executes transactions optimistically in parallel Programmer: TM System: 1) Checkpoints execution 2) Detects conflicts ?? 3) Commits or aborts and re-executes 

4 TM Implementations  Flavors of TM: Hardware (HTM), Software (STM), Hybrid (HyTM)  STM is especially compelling Exploit current commodity hardware (multicores) Learn about real TM systems and apps  Current STM Systems: Java: DSTM, ASTM C or C++: McRT icc, TL2, RSTM, OSTM  object-based or programmer intensive (or both) Our focus: arbitrary C/C++, realistic environment

5 my_app Programming with STM #include GTree *tree;... atomic { g_tree_insert(tree &key, &val); }... STM Compiler Source Code: Executable : Shared Library: glib Running Application : Not handled by current compiler/library-based STMs Loader kernel “Legacy Locks” Pre-compiled Binary System Calls 

6 JudoSTM: An Overview  Key design choices: 1)Dynamic Binary Rewriting (DBR)  insert instrumentation to implement STM 2)Value-based conflict detection  Resulting key features: 1)Privileged transactions (support system calls) 2)Legacy lock elision 3)Efficient invisible readers

7 JudoSTM Design Choice 1  Dynamic Binary Rewriting (DBR) Judo DBR Framework (user-space version of JIFL †) † JIT Instrumentation - A Novel Approach To Dynamically Instrument Operating Systems, SIGOPS EuroSys 2007

8 Dynamic Binary Rewriting Original Code:Code Cache: bb1 Judo bb3bb2 bb4 bb1

9 Dynamic Binary Rewriting Original Code:Code Cache: bb1 Judo bb3bb2 bb4 bb1 bb2

10 Dynamic Binary Rewriting Original Code:Code Cache: bb1 Judo bb3bb2 bb4 bb1 bb2 bb4 bb1 bb2

11 Judo - Performance Normalized Runtime Overhead Overhead low enough to implement STM?

12 DBR-Based STM Goal: Perform These Efficiently  For all non-stack write instructions Track write addresses and values (write-set) Write-buffer the values from regular memory  For all non-stack read instructions Redirect to the write-buffer If miss: track read addr.s and values (read-set)  When a transaction completes: 1)Acquire commit lock(s) 2)Validate read-set (value-based conflict detection) 3)Commit write-set to memory 4)Release commit lock(s)

13 DBR: Attractive Properties for STM  Performance: overheads are amortized code cache  Can handle arbitrary code and shared libraries any/all code is transactionalized as it executes  Sandboxed Transactions Typical STM:  inconsistent values could stray execution i.e., stray to non-transactionalized code (very bad!)  solution: frequent & costly read-set validation DBR-based STM:  any/all code is transactionalized as it executes Tough problems for conventional STMs addressed by DBR

14 JudoSTM Design Choice 2  Value-Based Conflict Detection (as opposed to location-based)

15 Location-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 000 Strip versions: Strips

16 Location-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: 000 Legend: ReadWritten Strip versions: Transaction 1:

17 Location-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: 000 Legend: ReadWritten Strip versions: Transaction 2:

18 Location-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: 000 Legend: ReadWritten Strip versions: Transaction 2: Commit step 1) Validate Read Set Commit step 2) Publish Writes (and inc version #s) 9 1

19 Location-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: 010 Legend: ReadWritten Strip versions: Transaction 1: Commit step 1) Validate Read Set   Abort! Note: all transactions must maintain strip version #s

20 Value-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 1:

21 Value-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 2:

22 Value-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 2: Commit step 1) Validate Read Set Commit step 2) Publish Writes 9

23 Value-Based Conflict Detection Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 1: Commit step 1) Validate Read Set  Abort!  Note: no version information to maintain

24  Privileged transactions Can execute (but not roll back) system calls Grab commit lock(s) when about to make a syscall  Release when transaction completes Only one privileged transaction exists at a time JudoSTM Feature 1:

25 Privileged Transactions Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 1:

26 Privileged Transactions Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 2: Privileged: can write directly to memory (privileged, syscalls) may be uninstrumented

27 Privileged Transactions Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2356 Transaction 1: Commit step 1) Validate Read Set  Abort!  Value-based conflict detection facilitates system calls within transactions!

28  Legacy Lock Elision Safely ignore locks within legacy code JudoSTM Feature 2:

29 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2 0 Transaction 1: 5 Lock: 2 6 Read/Write lock acquire 0 01

30 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2 0 Transaction 2: 5 Lock: 2 6 Read/Write lock acquire

31 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2 0 Transaction 2: 5 Lock: Read/Write lock release 0

32 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 2 0 Transaction 2: 5 Lock: Read/Write Commit step 1) Validate Read Set Commit step 2) Publish Writes 09 silent store

33 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 5 0 Transaction 2: 5 7 Lock: 6 6 Read/Write lock release 0 9

34 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 5 0 Transaction 2: 5 7 Lock: 6 6 Read/Write Commit step 1) Validate Read Set 9

35 Legacy Lock Elision Transaction 1: Transaction 2: Main Memory: Legend: ReadWritten 5 0 Transaction 2: 5 7 Lock: 6 6 Read/Write Commit step 2) Publish Writes 07 9 Value-based conflict detection facilitates the elision of legacy locks!

36 JudoSTM Feature 3:  Efficient Invisible Readers

37 Supporting Invisible Readers  Invisible Readers: don’t report reads to others good performance but can lead to inconsistent read data: errors!  Data errors: segfault, divide by zero Cheap solution: catch with trap/signal handlers  Control errors: jump to non-instrumented code Typical solution: verify read-set after every load  Expensive! O(N 2 ) DBR solution: prevented by sandboxing  DBR instruments all code as it executes

38 JudoSTM Details  Implementation

39 (reminder) Goal: Perform These Efficiently  For all non-stack write instructions Track write addresses and values (write-set) Buffer the values from regular memory  For all non-stack read instructions Redirect to the write-buffer If miss: track read addr.s and values (read-set)  When a transaction completes: 1)Acquire commit lock(s) 2)Validate read-set (value-based conflict detection) 3)Commit write-set to memory 4)Release commit lock(s)

40 Read/Write Buffer Implementation Read Hashtable: Read Buffer: Write Hashtable: Write Buffer: Linear probed open-addressed hashtables Address Efficient lookup: 5 insts for a hit (+ state-saving?) Efficient validate and commit?

41 Efficient Commit: Executable Write-Buffer movl $0x ,0x ret Write Hashtable: Top ptr Write Buffer: Pre-allocated buffer of move instructions Emit value-address pairs as transaction executes

42 Efficient Commit: Executable Write-Buffer movl $0x ,0x movl $0x ,0x80B10BB8 ret Write Hashtable: Top ptr Write Buffer: Pre-allocated buffer of move instructions Emit value-address pairs as transaction executes

43 Efficient Commit: Executable Write-Buffer movl $0x ,0x movl $0x0000ab42,0x80B10BCC movl $0x ,0x80B10BB8 ret Write Hashtable: Top ptr Write Buffer: Pre-allocated buffer of move instructions Emit value-address pairs as transaction executes

44 Efficient Commit: Executable Write-Buffer movl $0x ,0x movl $0x80B10CFC,0x80B10CA4 movl $0x0000ab42,0x80B10BCC movl $0x ,0x80B10BB8 ret Write Hashtable: Top ptr Write Buffer: Pre-allocated buffer of move instructions Emit value-address pairs as transaction executes

45 Efficient Commit: Executable Write-Buffer movl $0x ,0x movl $0x80B10CFC,0x80B10CA4 movl $0x0000ab42,0x80B10BCC movl $0x ,0x80B10BB8 ret Write Hashtable: Top ptr Write Buffer: Execute the write-buffer to commit!

46 cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x jne,pn judostm_trans_abort ret Read Hashtable: Read Buffer: Efficient Validation: Executable Read-Buffer Top ptr Pre-allocated buffer of compare & jump instructions Emit value-address pairs as transaction executes

47 cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x00000a34, 0x80B10CA4 jne,pn judostm_trans_abort ret Read Hashtable: Read Buffer: Efficient Validation: Executable Read-Buffer Top ptr Pre-allocated buffer of compare & jump instructions Emit value-address pairs as transaction executes

48 cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x80B10BB8 jne,pn judostm_trans_abort cmp $0x00000a34, 0x80B10CA4 jne,pn judostm_trans_abort ret Read Hashtable: Read Buffer: Efficient Validation: Executable Read-Buffer Top ptr Pre-allocated buffer of compare & jump instructions Emit value-address pairs as transaction executes

49 cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x80B10BCC jne,pn judostm_trans_abort cmp $0x , 0x80B10BB8 jne,pn judostm_trans_abort cmp $0x00000a34, 0x80B10CA4 jne,pn judostm_trans_abort ret Read Hashtable: Read Buffer: Efficient Validation: Executable Read-Buffer Top ptr Pre-allocated buffer of compare & jump instructions Emit value-address pairs as transaction executes

50 cmp $0x , 0x jne,pn judostm_trans_abort cmp $0x , 0x80B10BCC jne,pn judostm_trans_abort cmp $0x , 0x80B10BB8 jne,pn judostm_trans_abort cmp $0x00000a34, 0x80B10CA4 jne,pn judostm_trans_abort ret Read Hashtable: Read Buffer: Efficient Validation: Executable Read-Buffer Top ptr Execute the read-buffer to validate the read-set!

51 Evaluation  JudoSTM performance Comparison with Rochester’s RSTM † †

52 RSTM vs JudoSTM: Design RSTMJudoSTM LanguageC++C/C++ Programming model Library API, rewrite code atomic{…} Conflict detection Object-level location-based Value-based Memory Allocation Custom“Hoard” scalable parallel allocator Fast commitObject-cloning & pointer-switching Executable write- buffer JudoSTM more flexible, less intrusive; but performance?

53 Experimental Framework  RSTM micro-benchmarks Linked List, Hash Table, RBTree Equal mix of insert, remove, and lookup Measure throughput (transactions/sec)  Test platform 4-way SMP Intel Pentium 4 Xeon - 2.8GHz L1d/L2/L3 cache sizes: 8KB/512KB/2MB Linux  with per thread signal handler support

54 Linked List Coarse-grained locking best, but not scaling

55 Linked List – Zoomed in Single-lock JudoSTM scaling nicely ; RSTM flatlined 

56 Hash Table Distributed-lock JudoSTM beats CG-locking, tracks RSTM

57 RBTree JudoSTM on track to scale past CG-locking; RSTM flatlined 

58 Conclusions  Judo: highly-efficient DBR framework Beats DynamoRIO on SPEC benchmarks  JudoSTM: First STM based on DBR Value-based conflict detection Executable read/write buffers  Desirable features: Efficient invisible readers (sandboxing) Legacy lock elision Privileged transactions (system call support) Performance comparable to RSTM Facilitates STM for real programs & environments!

59 Backups

60 JudoSTM Details  Programming with JudoSTM

61 my_app Programming with JudoSTM #include GTree *tree;... atomic { g_tree_insert(tree &key, &val); }... Source Code: Executable : Shared Library: glib kernel loader Running Application: #include GTree *tree;... g_tree_insert(tree &key, &val);... Library: judoSTM Instrumented my_app + glib Code Cache #include GTree *tree;... atomic { g_tree_insert(tree &key, &val); }... #include GTree *tree;... judostm_start() g_tree_insert(tree &key, &val); judostm_stop()... gcc  Easy to use, with no compiler support! #ifndef JUDOSTM_H #define JUDOSTM_H extern void judostm_start(void); extern void judostm_stop(void); #define atomic \ asm __volatile__ ("":::"eax", "ecx", "edx", "ebx", "edi", \ "esi", "flags", "memory");\ int __count = 0; \ judostm_start();\ for (; __count < 1; judostm_stop(), __count++) #endif