Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.

Slides:



Advertisements
Similar presentations
Safe Open-Nested Transactions Jim Sukha MIT CSAIL Supercomputing Technologies Group Kunal Agrawal, I-Ting Angelina Lee, Bradley C. Kuszmaul, Charles E.
Advertisements

Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Concurrency Control II. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Concurrency Control Part 2 R&G - Chapter 17 The sequel was far better than the original! -- Nobody.
Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Rich Transactions on Reasonable Hardware J. Eliot B. Moss Univ. of Massachusetts,
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Concurrency.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
CSE332: Data Abstractions Lecture 7: AVL Trees Tyler Robison Summer
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Concurrency Control and Recovery In real life: users access the database concurrently, and systems crash. Concurrent access to the database also improves.
Distributed Systems Fall 2010 Transactions and concurrency control.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
Selfishness in Transactional Memory Raphael Eidenbenz, Roger Wattenhofer Distributed Computing Group Game Theory meets Multicore Architecture.
Transactions or Concurrency Control. Introduction A program which operates on a DB performs 2 kinds of operations: –Access to the Database (Read/Write)
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.
Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh
Copyright 2007 Sun Microsystems, Inc SNZI: Scalable Non-Zero Indicator Yossi Lev (Brown University & Sun Microsystems Laboratories) Joint work with: Faith.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
SOEN 6011 Software Engineering Processes Section SS Fall 2007 Dr Greg Butler
AXML Transactions Debmalya Biswas. 16th AprSEIW Transactions A transaction can be considered as a group of operations encapsulated by the operations.
TRANSACTIONS. Objectives Transaction Concept Transaction State Concurrent Executions Serializability Recoverability Implementation of Isolation Transaction.
CS 162 Discussion Section Week 9 11/11 – 11/15. Today’s Section ●Project discussion (5 min) ●Quiz (10 min) ●Lecture Review (20 min) ●Worksheet and Discussion.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Maintaining SP Relationships Efficiently, on-the-fly Jeremy Fineman.
Jinze Liu. ACID Atomicity: TX’s are either completely done or not done at all Consistency: TX’s should leave the database in a consistent state Isolation:
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
Lecture 20: Consistency Models, TM
Transaction Management
Atomic Operations in Hardware
Atomic Operations in Hardware
Faster Data Structures in Transactional Memory using Three Paths
Atomicity CS 2110 – Fall 2017.
Amir Kamil and Katherine Yelick
Changing thread semantics
Lecture 6: Transactions
EEC 688/788 Secure and Dependable Computing
Lecture 22: Consistency Models, TM
EEC 688/788 Secure and Dependable Computing
Hybrid Transactional Memory
Transactions with Nested Parallelism
Amir Kamil and Katherine Yelick
Non-preemptive Semantics for Data-race-free Programs
CS333 Intro to Operating Systems
EEC 688/788 Secure and Dependable Computing
Cilk and Writing Code for Hardware
Lecture: Consistency Models, TM
Concurrent Cache-Oblivious B-trees Using Transactional Memory
Controlled Interleaving for Transactions
Presentation transcript:

Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT

Program Representation ParallelIncrement(){ parallel { x ← x+1 }//Thread1 { x ← x+1 }//Thread2 } The parallel keyword allows the two following code blocks (enclosed in {.} ) to execute in parallel.

Program Representation S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 R x W x R x W x ParallelIncrement(){ parallel { x ← x+1 }//Thread1 { x ← x+1 }//Thread2 } The parallel keyword allows the two following code blocks (enclosed in {.} ) to execute in parallel. We model the execution of a multithreaded program as a walk of a series-parallel computation tree.

Program Representation S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 R x W x R x W x ParallelIncrement(){ parallel //P 1 { x ← x+1 }//S 1 { x ← x+1 }//S 2 } The parallel keyword allows the two following code blocks (enclosed in {.} ) to execute in parallel. We model the execution of a multithreaded program as a walk of a series-parallel computation tree. Internal nodes of the tree are S (series) or P (parallel) nodes. The leaves of the tree are memory operations. u1u1 u2u2 u3u3 u4u4

Program Representation S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 R x W x R x W x The parallel keyword allows the two following code blocks (enclosed in {.} ) to execute in parallel. We model the execution of a multithreaded program as a walk of a series-parallel computation tree. Internal nodes of the tree are S (series) or P (parallel) nodes. The leaves of the tree are memory operations. All child subtrees of an S node must execute in series in left-to-right order. The child subtrees of a P node can potentially execute in parallel. ParallelIncrement(){ parallel //P 1 { x ← x+1 }//S 1 { x ← x+1 }//S 2 } u1u1 u2u2 u3u3 u4u4

Data Races Two (or more) parallel accesses to the same memory location (where one of the accesses is a write) constitute a data race. (In the tree, two accesses can happen in parallel if their least common ancestor is a P node.) S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 R x W x R x W x ParallelIncrement(){ parallel //P 1 { x ← x+1 }//S 1 { x ← x+1 }//S 2 } u1u1 u2u2 u3u3 u4u4 There are races between u 1 and u 4, u 3 and u 2, and u 2 and u 4.

Data Races Two (or more) parallel accesses to the same memory location (where one of the accesses is a write) constitute a data race. (In the tree, two accesses can happen in parallel if their least common ancestor is a P node.) Data races lead to nondeterministic program behavior. Traditionally, locks are used to prevent data races. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 R x W x R x W x ParallelIncrement(){ parallel //P 1 { x ← x+1 }//S 1 { x ← x+1 }//S 2 } u1u1 u2u2 u3u3 u4u4 There are races between u 1 and u 4, u 3 and u 2, and u 2 and u 4.

Transactional Memory S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 R x W x ParallelIncrement(){ parallel //P 1 { atomic{x ← x+1}//A }//S 1 { atomic{x ← x+1}//B }//S 2 } B B R x W x A A Transactional memory has been proposed as an alternative to locks. The programmer simply encloses the critical region in an atomic block. The runtime system ensures that the region executes atomically by tracking its reads and writes, detecting conflicts, and aborting and retrying if necessary. u1u1 u2u2 u3u3 u4u4

Nested Parallelism One can generate more parallelism by nesting parallel blocks. ParallelIncrement(){ parallel //P 1 { x ← x+1 }//S 1 { x ← x+1 parallel //P 2 { x ← x+1 }//S 3 { x ← x+1 }//S 4 }//S 2 } S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 S3S3 S3S3 S4S4 S4S4 R x W x R x W x R x W x P2P2 P2P2 R x W x u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 u7u7 u8u8

Nested Parallelism in Transactions S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 A A B B S3S3 S3S3 S4S4 S4S4 R x W x R x W x R x W x P2P2 P2P2 R x W x Again, we use transactions to prevent data races. ParallelIncrement(){ parallel //P 1 { atomic{x ← x+1}//A }//S 1 { atomic{ x ← x+1 parallel //P 2 { x ← x+1 }//S 3 { x ← x+1 }//S 4 }//B }//S 2 }

Nested Parallelism in Transactions Use transactions to prevent data races. (Notice the parallelism inside transaction B.) S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 R x W x R x W x P2P2 P2P2 R x W x A A R x W x ParallelIncrement(){ parallel //P 1 { atomic{x ← x+1}//A }//S 1 { atomic{ x ← x+1 parallel //P 2 { x ← x+1 }//S 3 { x ← x+1 }//S 4 }//B }//S 2 } u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 u7u7 u8u8

Nested Parallelism in Transactions Use transactions to prevent data races. (Notice the parallelism inside transaction B.) This program unfortunately has data races. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 R x W x R x W x P2P2 P2P2 R x W x A A R x W x ParallelIncrement(){ parallel //P 1 { atomic{x ← x+1}//A }//S 1 { atomic{ x ← x+1 parallel //P 2 { x ← x+1 }//S 3 { x ← x+1 }//S 4 }//B }//S 2 } u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 u7u7 u8u8

ParallelIncrement(){ parallel { atomic{x ← x+1}//A } { atomic{ x ← x+1 parallel { atomic{x ← x+1}//C } { atomic{x ← x+1}//D } }//B }//S 2 } S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 A A B B S3S3 S3S3 S4S4 S4S4 C C D D R x W x R x W x R x W x P2P2 P2P2 R x W x Nested Parallelism and Nested Transactions Add more transactions u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 u7u7 u8u8

ParallelIncrement(){ parallel { atomic{x ← x+1}//A } { atomic{ x ← x+1 parallel { atomic{x ← x+1}//C } { atomic{x ← x+1}//D } }//B }//S 2 } S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 A A B B S3S3 S3S3 S4S4 S4S4 C C D D R x W x R x W x R x W x P2P2 P2P2 R x W x Nested Parallelism and Nested Transactions Transactions C and D are nested inside transaction B. Therefore transaction B has both nested transactions and nested parallelism. u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 u7u7 u8u8

Our Contribution We describe CWSTM, a theoretical design for a software transactional memory system which allows nested parallelism in transactions for dynamic multithreaded languages which use a work-stealing scheduler. Our design efficiently supports nesting and parallelism of unbounded depth. CWSTM supports – Efficient Eager Conflict detection, and – Eager Updates (Fast Commits). We prove that CWSTM exhibits small overhead on a program with transactions compared to the same program with all atomic blocks removed.

More Precisely… A work-stealing scheduler guarantees that a transaction-less program with work T 1 and critical path T ∞ running on P processors completes in time O(T 1 /P + T ∞ ). – Provides linear speedup when T 1 /T ∞ >> P.

More Precisely… A work-stealing scheduler guarantees that a transaction-less program with work T 1 and critical path T ∞ running on P processors completes in time O(T 1 /P + T ∞ ). – Provides linear speedup when T 1 /T ∞ >> P. If a program has no aborts and no read contention*, then CWSTM completes the program with transactions in time O(T 1 /P + PT ∞ ). – Provides linear speedup when T 1 /T ∞ >> P 2.

More Precisely… A work-stealing scheduler guarantees that a transaction-less program with work T 1 and critical path T ∞ running on P processors completes in time O(T 1 /P + T ∞ ). – Provides linear speedup when T 1 /T ∞ >> P. If a program has no aborts and no read contention*, then CWSTM completes the program with transactions in time O(T 1 /P + PT ∞ ). – Provides linear speedup when T 1 /T ∞ >> P 2. *In the presence of multiple readers, a write to a memory location has to check for conflicts against multiple readers.

Outline Introduction Semantics of TM Difficulty of Conflict Detection Access Stack Lazy Access Stack Intuition for Final Design Using Traces and Analysis Conclusions and Future Work

Conflicts in Transactions Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T. Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 parallel { atomic{ x ← 1 y ← 2 }//A }//S 1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S 2 B B W y A A W(A)={}W(B)={} W z W x C C W z W(C)={} u1u1 u2u2 u3u3 u4u4 u5u5

Conflicts in Transactions S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W y A A W(A)={x}W(B)={} W z W x C C W z W(C)={} Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T. Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap. parallel { atomic{ x ← 1 y ← 2 }//A }//S 1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S 2 u1u1 u2u2 u3u3 u4u4 u5u5

Conflicts in Transactions S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W y A A W(A)={x}W(B)={z} W z W x C C W z W(C)={} Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T. Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap. parallel { atomic{ x ← 1 y ← 2 }//A }//S 1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S 2 u1u1 u2u2 u3u3 u4u4 u5u5

Conflicts in Transactions S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W y A A W(A)={x}W(B)={z} W z W x C C W z W(C)={z} Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T. Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap. parallel { atomic{ x ← 1 y ← 2 }//A }//S 1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S 2 u1u1 u2u2 u3u3 u4u4 u5u5

Conflicts in Transactions S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W y A A W(A)={x}W(B)={z} W z W x CONFLICT!! W x C C W z W(C)={z, x} Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T. Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap. parallel { atomic{ x ← 1 y ← 2 }//A }//S 1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S 2 u1u1 u2u2 u3u3 u4u4 u5u5

Nested Transactions: Commit and Abort If A and B conflict, one of them is aborted (and possibly retried). When a transaction aborts, its write set is discarded. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 C C D D W z W x W z W y P2P2 P2P2 W z W u A A W y W x

Nested Transactions: Commit and Abort If two transactions conflict, one of them is aborted and its write set is discarded. W(A)={y} W(B)={z, x} W(C)={z, u} S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 C C D D W z W y P2P2 P2P2 W z W x A A W y W x W u

Nested Transactions: Commit and Abort If two transactions conflict, one of them is aborted and its write set is discarded. W(A)={y} W(B)={z, x} W(C)={z, u} S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 C C D D W z W y P2P2 P2P2 W z W x A A W y W x W u

Nested Transactions: Commit and Abort If two transactions conflict, one of them is aborted and its write set is discarded. If a transaction completes without a conflict, it is committed and its write set is merged with it’s parent transaction’s write set W(B)={z, x} W(C)={z, u} S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 C C D D W z W u W z W y P2P2 P2P2 W z W x W y W x W(A)={y} A A

Nested Transactions: Commit and Abort If two transactions conflict, one of them is aborted and its write set is discarded. If a transaction completes without a conflict, it is committed and its write set is merged with it’s parent transaction’s write set W(B)={z, x, u} W(C)={z, u} S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B S3S3 S3S3 S4S4 S4S4 C C D D W z W u W z W y P2P2 P2P2 W z W x W y W x W(A)={y} A A

Outline Introduction Semantics of TM Difficulty of Conflict Detection Access Stack Lazy Access Stack Intuition for Final Design Using Traces and Analysis Conclusions and Future Work

Conflicts in Serial Transactions Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). Two writes to the same memory location cause a conflict iff they are on different threads. TM system can just check to see if some other thread wrote to the memory location. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W z A A W x D D W z W x C C F F W z W x E E W z Thread 1 Thread 2

Conflicts in Serial Transactions Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). Two writes to the same memory location cause a conflict iff they are on different threads. TM system can just check to see if some other thread wrote to the memory location. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W z A A W x D D W z W x C C F F W z W x E E W z Thread 1 Thread 2 locationThread x z

Conflicts in Serial Transactions Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). Two writes to the same memory location cause a conflict iff they are on different threads. TM system can just check to see if some other thread wrote to the memory location. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W z A A W x D D W z W x C C F F W z W x E E W z Thread 1 Thread 2 locationThread x z1

Conflicts in Serial Transactions Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). Two writes to the same memory location cause a conflict iff they are on different threads. TM system can just check to see if some other thread wrote to the memory location. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W z A A W x D D W z W x C C F F W z W x E E W z Thread 1 Thread 2 locationThread x2 z1

Conflicts in Serial Transactions Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). Two writes to the same memory location cause a conflict if and only if they are on different threads. TM system can just check to see if some other thread wrote to the memory location. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W z A A W x D D W z W x C C F F W z W x E E W z Thread 1 Thread 2 locationThread x2 z1

Conflicts in Serial Transactions Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). Two writes to the same memory location cause a conflict if and only if they are on different threads. TM system can just check to see if some other thread wrote to the memory location. S1S1 S1S1 S2S2 S2S2 S0S0 S0S0 P1P1 P1P1 B B W z A A W x D D W z W x C C F F W z W x E E W z Thread 1 Thread 2 locationThread x2 z1 CONFLICT!!

Thread ID is not enough A work-stealing scheduler does not create a thread for every S- node; instead, it schedules a computation on a fixed number of worker threads. Runtime can not simply compare worker ids to determine whether two transactions conflict. X0X0 X0X0 P1P1 P1P1 S1S1 S1S1 X1X1 X1X1 S2S2 S2S2 P2P2 P2P2 S5S5 S5S5 S6S6 S6S6 P2P2 P2P2 S3S3 S3S3 S4S4 S4S4 Y1Y1 Y1Y1 P6P6 P6P6 S 11 S 12 Y2Y2 Y2Y2 X2X2 X2X2 P5P5 P5P5 S 10 S 11 Z3Z3 Z3Z3 P8P8 P8P8 S 15 S 16 Z4Z4 Z4Z4 P4P4 P4P4 S7S7 S7S7 S8S8 S8S8 Z1Z1 Z1Z1 P7P7 P7P7 S 13 S 14 Z2Z2 W(Z 2 )={x,..} W(Y 1 )={x,..} Example: Both Z 1 and Y 1 execute on the same worker. Z 2 conflicts with Z 1, but not with Y 1. W(X)={x,..} W(Z 1 )={x,..}

Thread ID is not enough A work-stealing scheduler does not create a thread for every S-node; instead, it schedules a computation on a fixed number of worker threads. Runtime can not simply compare worker ids to determine whether two transactions conflict. X0X0 X0X0 P1P1 P1P1 S1S1 S1S1 Y1Y1 Y1Y1 S2S2 S2S2 P2P2 P2P2 S5S5 S5S5 S6S6 S6S6 P2P2 P2P2 S3S3 S3S3 S4S4 S4S4 Y2Y2 Y2Y2 P6P6 P6P6 S 11 S 12 Y3Y3 Y3Y3 X2X2 X2X2 P5P5 P5P5 S 10 S 11 Z3Z3 Z3Z3 P8P8 P8P8 S 15 S 16 Z4Z4 Z4Z4 P4P4 P4P4 S7S7 S7S7 S8S8 S8S8 Z1Z1 Z1Z1 P7P7 P7P7 S 13 S 14 Z2Z2 W(Y 1 )={x,..} W(Y 3 )={x,..}

Thread ID is not enough X0X0 X0X0 P1P1 P1P1 S1S1 S1S1 Y1Y1 Y1Y1 S2S2 S2S2 P2P2 P2P2 S5S5 S5S5 S6S6 S6S6 P2P2 P2P2 S3S3 S3S3 S4S4 S4S4 Y2Y2 Y2Y2 P6P6 P6P6 S 11 S 12 Y3Y3 Y3Y3 X2X2 X2X2 P5P5 P5P5 S 10 S 11 Z3Z3 Z3Z3 P8P8 P8P8 S 15 S 16 Z4Z4 Z4Z4 P4P4 P4P4 S7S7 S7S7 S8S8 S8S8 Z1Z1 Z1Z1 P7P7 P7P7 S 13 S 14 Z2Z2 W(Y 1 )={x,..} W(Y 3 )={x,..} W(Y 2 )={x,..} A work-stealing scheduler does not create a thread for every S-node; instead, it schedules a computation on a fixed number of worker threads. Runtime can not simply compare worker ids to determine whether two transactions conflict.

Outline Introduction Semantics of TM Difficulty of Conflict Detection Access Stack Lazy Access Stack Intuition for Final Design Using Traces and Analysis Conclusions and Future Work

CWSTM Invariant: Conflict - Free Execution I NVARIANT 1: At any time, for any given location L, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1

CWSTM Invariant: Conflict - Free Execution I NVARIANT 1: At any time, for any given location L, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. Let X be the end of the chain. P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1 X

CWSTM Invariant: Conflict - Free Execution I NVARIANT 2: If Z tries to access object L : No conflict if X is an ancestor of Z. (e.g., Z 1 ). Conflict if X is not an ancestor of Z. (e.g., Z 2 ). I NVARIANT 1: At any time, for any given location L, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. Let X be the end of the chain. P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1 X

Design Attempt 1 P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y0Y0 Y1Y1 Y3Y3 : Top = X Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1 Access Stack for L. X

Design Attempt 1 P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y0Y0 Y1Y1 Y3Y3 : Top = X Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1 Access Stack for L. X For every L, keep an access stack for L, holding the chain of active transactions which have L in their writeset.

Design Attempt 1 P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y0Y0 Y1Y1 Y3Y3 : Top = X Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1 Access Stack for L. X For every L, keep an access stack for L, holding the chain of active transactions which have L in their writeset. – Access stacks are changed on commits and aborts. If Y 3 commits, it is replaced by Y 2. If Y 3 aborts, it disappears from the stack and Y 1 is at the top.

Design Attempt 1 For every L, keep an access stack for L, holding the chain of active transactions which have L in their writeset. – Access stacks are changed on commits and aborts. If Y 3 commits, it is replaced by Y 2. If Y 3 aborts, it disappears from the stack and Y 1 is at the top. Let X be the top of access stack for L. When transaction Z tries to access L, report a conflict if and only if X is not an ancestor of Z. P P S S S S P P S S S S P P S S S S P P S S S S Z2Z2 Y0Y0 Y1Y1 Y3Y3 : Top = X Y3Y3 Y0Y0 Y2Y2 P P S S S S Inactive Trans accessed L Active Z1Z1 Y1Y1 Access Stack for L. X

Maintenance of access stack on commit. Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. W(Y d )= { L d } W(Y d-1 )={ L d-1 } W(Y d-2 )= { L d-2 } W(Y 2 )= { L 2 }... W(Y 1 )= { L 1 } W(Y 0 )= { L 0 } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd

Maintenance of access stack on commit. Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). W(Y d )= { L d } W(Y d-1 )={ L d-1 } W(Y d-2 )= { L d-2 } W(Y 2 )= { L 2 }... W(Y 1 )= { L 1 } W(Y 0 )= { L 0 } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd

Maintenance of access stack on commit. Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated. W(Y d )= { L d } W(Y d-1 )={ L d-1 } W(Y d-2 )= { L d-2 } W(Y 2 )= { L 2 }... W(Y 1 )= { L 1 } W(Y 0 )= { L 0 } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd

Maintenance of access stack on commit. Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated. W(Y d )= { L d } W(Y d-1 )={ L d-1,L d } W(Y d-2 )= { L d-2 } W(Y 2 )= { L 2 } O(1)... W(Y 1 )= { L 1 } W(Y 0 )= { L 0 } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd

Maintenance of access stack on commit. W(Y d )= { L d } W(Y d-1 )={ L d-1,L d } W(Y d-2 )= { L d-2,L d-1,L d } W(Y 2 )= { L 2 } O(1) O(2)... W(Y 1 )= { L 1 } W(Y 0 )= { L 0 } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated.

Maintenance of access stack on commit. W(Y d )= { L d } W(Y d-1 )={ L d-1,L d } W(Y d-2 )= { L d-2,L d-1,L d } W(Y 2 )= { L 2,L 3 … L d-1,L d } O(1) O(2) O(d-1)... W(Y 1 )= { L 1,L 2,L 3 … L d-1,L d } W(Y 0 )= { L 0 } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated.

Maintenance of access stack on commit. W(Y d )= { L d } W(Y d-1 )={ L d-1,L d } W(Y d-2 )= { L d-2,L d-1,L d } W(Y 2 )= { L 2,L 3 … L d-1,L d } O(1) O(2) O(d-1) O(d)... W(Y 1 )= { L 1,L 2,L 3 … L d-1,L d } W(Y 0 )= { L 0,L 1,L 2,… L d-1, L d } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated.

Maintenance of access stack on commit. W(Y d )= { L d } W(Y d-1 )={ L d-1,L d } W(Y d-2 )= { L d-2,L d-1,L d } W(Y 2 )= { L 2,L 3 … L d-1,L d } O(1) O(2) O(d-1) O(d)... W(Y 1 )= { L 1,L 2,L 3 … L d-1,L d } W(Y 0 )= { L 0,L 1,L 2,… L d-1, L d } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated. On commit of transaction Y i, (d-i+1) access stacks must be updated.

Maintenance of access stack on commit. W(Y d )= { L d } W(Y d-1 )={ L d-1,L d } W(Y d-2 )= { L d-2,L d-1,L d } W(Y 2 )= { L 2,L 3 … L d-1,L d } O(1) O(2) O(d-1) O(d)... W(Y 1 )= { L 1,L 2,L 3 … L d-1,L d } W(Y 0 )= { L 0,L 1,L 2,… L d-1, L d } Y d-2 Y0Y0 Y1Y1 Y2Y2 Y d-1 YdYd Consider a serial program with a chain of nested transactions, Y 0, Y 1, … Y d. Each Y i accesses a unique location L i. – Total work with no transactions: O(d). On commit of a transaction, the access stacks of all the memory locations its write set must be updated. On commit of transaction Y i, (d-i+1) access stacks must be updated. – Overhead due to transaction commits: O(d 2 ).

Outline Introduction Semantics of TM Difficulty of Conflict Detection Access Stack Lazy Access Stack Intuition for Final Design Using Traces and Analysis Conclusions and Future Work

Lazy Access Stack P P S S S S P P S S S S P P S S S S P P S S S S Y0Y0 Y1Y1 Y4Y4 Y5Y5 Y6Y6 Y8Y8 Z3Z3 Z4Z4 Y2Y2 Y3Y3 P P S S S S P P S S S S Y9Y9 Z2Z2 Z1Z1 Y7Y7 Y7Y7 Lazy Access Stack for L. Y0Y0 Y0Y0 Y1Y1 Y1Y1 Y2Y2 Y2Y2 Top= X Y3Y3 Y3Y3 Y4Y4 Y4Y4 Y5Y5 Y5Y5 Y7Y7 Y7Y7 Y9Y9 Y9Y9 Y0Y0 Y0Y0 Y3Y3 Y3Y3 Y6Y6 Y6Y6 Y8Y8 Y8Y8 Equivalent (Non- Lazy) Access Stack Don’t update access stacks on commits. Every transaction Y in the stack implicitly represents its closest active transactional ancestor. Inactive Trans accessed L Active X

The Oracle TheOracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, P P S S S S P P S S S S P P S S S S P P S S S S Y0Y0 Y1Y1 Y4Y4 Y5Y5 Y6Y6 Y8Y8 Z3Z3 Z4Z4 Y2Y2 Y3Y3 P P S S S S P P S S S S Y7Y7 Y7Y7 Inactive Trans accessed L Active Y9Y9 Z2Z2 Z1Z1 X

Closest Active Ancestor TheOracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, Walk up the tree to find the X. P P S S P P S S Y0Y0 Y1Y1 Y d-1 Y3Y3 P SS Y2Y2 Y3Y3 YdYd Y4Y4 YdYd Y0Y0 Y1Y1 Y0Y0 Y2Y2 Lazy Stack Inactive Trans accessed L Active

Closest Active Ancestor TheOracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, Walk up the tree to find the X. P ROBLEM : Each memory access might take d time (d is the nesting depth). P P S S P P S S Y0Y0 Y1Y1 Y d-1 Y3Y3 P SS Y2Y2 Y3Y3 YdYd Y4Y4 YdYd Y0Y0 Y1Y1 Y0Y0 Y2Y2 Lazy Stack Inactive Trans accessed L Active

The XConflict Oracle XConflict_Oracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, Solution 2: Use data structures to maintain X. Inactive L in writeset Active P P S S P P S S Y0Y0 Y1Y1 Y d-1 Y3Y3 P SS Y2Y2 Y3Y3 YdYd Y4Y4 YdYd Y0Y0 Y1Y1 Y0Y0 Y2Y2 Lazy Stack

The XConflict Oracle XConflict_Oracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, Solution 2: Use data structures to maintain X. Problem: The data structres have to be modified on every commit, leading to synchronization overhead. Inactive L in writeset Active P P S S P P S S Y0Y0 Y1Y1 Y d-1 Y3Y3 P SS Y2Y2 Y3Y3 YdYd Y4Y4 YdYd Y0Y0 Y1Y1 Y0Y0 Y2Y2 Lazy Stack

Closest Active Ancestor TheOracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, CWSTM uses an XConflict data structure which supports the above query in O(1) time, because it does not always need to find X to answer the query. P P S S P P S S Y0Y0 Y1Y1 Y d-1 Y3Y3 P SS Y2Y2 Y3Y3 YdYd Y4Y4 YdYd Y0Y0 Y1Y1 Y0Y0 Y2Y2 Lazy Stack Inactive Trans accessed L Active

Outline Introduction Computation Tree Definition of Conflicts and Design Attempt 1 Access Stack Lazy Access Stack Intuition for Final Design Using Traces and Analysis Conclusions and Future Work

Traces X0X0 P1P1 S1S1 X1X1 S2S2 P2P2 S3S3 S4S4 P3P3 S5S5 S6S6 Y1Y1 P4P4 S8S8 S9S9 Y2Y2 X2X2 P5P5 S 10 S 11 Z1Z1 P6P6 S 12 S 13 Z2Z workers To support XConflict queries efficiently, we group sections of the computation tree into traces. * Every trace executes serially on one processor; no synchronization overhead within a trace. Traces are created and modified only on steals. In CWSTM: # traces = O(# steals) Work Stealing Theorem: # steals is small Overhead of maintaining traces small.

Ancestor Relationships and Traces P S P S S P S S P SS P SS X0X0 Z1Z1 S P S S P S S P SS P S S Z2Z2 Z3Z3 X3X3 Complete Trace Active Trace CWSTM can use traces to answer a restricted class of ancestor queries efficiently. T HEOREM : For any active node X, and running node Z, X is an ancestor of Z iff trace( X ) is an ancestor of trace( Z ). X1X1 X4X4 X5X5 X2X2 X6X6 Example: X 4 is an ancestor of Z 2. X 1 is not an ancestor of Z 2.

The XConflict Query with Traces TheOracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } When a transaction Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack, XConflict(Y, Z) { U X ← trace containing X // (X is Y’s closest // active ancestor // transaction ) if (U x is an ancestor of trace containing Z) return “no conflict” else return “conflict”; } Oracle Query Actual CWSTM Query

Sources of Overhead in CWSTM Building the computation tree. Queries to XConflict: – At most one on every memory access. Updates to traces for XConflict. – Creating/splitting traces. – Building an order-maintenance data structure on traces for ancestor queries. – Merging complete traces together. No rollbacks or retries if we assume no conflicts. *Assuming no concurrent reads to the same location and no aborts. *

Sources of Overhead in CWSTM Building the computation tree. Queries to XConflict: – At most one for every memory access*. Updates to traces for XConflict. – Creating/splitting traces. – Maintaining data structures for ancestor queries on traces. – Merging complete traces together. No rollbacks or retries if we assume no conflicts. T HEOREM : For a computation with no transaction conflicts and no concurrent readers to a shared memory location, CWSTM executes the computation in O(T 1 /P + PT ∞ ) time. O(1)-factor increase on total work (T 1 ). Increases critical path to O(PT ∞ ). *Assuming no concurrent reads to the same location and no aborts.

Future Work CWSTM is the first design which supports nested parallelism and nested transactions in TM and guarantees low overhead (asymptotically). In Future- – Implement CWSTM in the Cilk runtime system and evaluate its performance. – Is there a better design which handles concurrent readers more efficiently? – Nested Parallelism in TM for other schedulers.