Transactions with Nested Parallelism

Slides:



Advertisements
Similar presentations
Safe Open-Nested Transactions Jim Sukha MIT CSAIL Supercomputing Technologies Group Kunal Agrawal, I-Ting Angelina Lee, Bradley C. Kuszmaul, Charles E.
Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.
Concurrency Control II. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Concurrency Control Part 2 R&G - Chapter 17 The sequel was far better than the original! -- Nobody.
Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Sathya Peri, IIT Patna, India, K.Vidyasankar, Memorial University, St John’s, Canada, 1 Efficient Non-Blocking.
Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Distributed Systems Fall 2010 Transactions and concurrency control.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
Randomized Algorithms - Treaps
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Distributed Transactions
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
CS533 – Spring Jeanie M. Schwenk Experiences and Processes and Monitors with Mesa What is Mesa? “Mesa is a strongly typed, block structured programming.
6.830 Lecture 14 Two-phase Locking Recap Optimistic Concurrency Control 10/28/2015.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
Lecture 9- Concurrency Control (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
1 Controlled concurrency Now we start looking at what kind of concurrency we should allow We first look at uncontrolled concurrency and see what happens.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 Chapter 11 Global Properties (Distributed Termination)
Database Isolation Levels. Reading Database Isolation Levels, lecture notes by Dr. A. Fekete, resentation/AustralianComputer.
Maintaining SP Relationships Efficiently, on-the-fly Jeremy Fineman.
Jinze Liu. ACID Atomicity: TX’s are either completely done or not done at all Consistency: TX’s should leave the database in a consistent state Isolation:
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Object Lifetime and Pointers
CILK: An Efficient Multithreaded Runtime System
Names and Attributes Names are a key programming language feature
Transaction Management
Computational Models Database Lab Minji Jo.
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Computer Engg, IIT(BHU)
Multiple Granularity Granularity is the size of data item  allowed to lock. Multiple Granularity is the hierarchically breaking up the database into portions.
Atomicity CS 2110 – Fall 2017.
Temple University – CIS Dept. CIS661 – Principles of Data Management
Changing thread semantics
Multi-Way Search Trees
Concurrency Control II (OCC, MVCC)
Lecture 6: Transactions
Multithreaded Programming in Cilk LECTURE 1
CSE373: Data Structures & Algorithms Lecture 5: AVL Trees
Database Transactions
Basic Two Phase Locking Protocol
6.830 Lecture 14 Two-phase Locking Recap Optimistic Concurrency Control 10/28/2015.
Hybrid Transactional Memory
CSE373: Data Structures & Algorithms Implementing Union-Find
Ranjeet Kumar K. Vidyasankar Memorial University St. John’s CANADA
Richard Anderson Spring 2016
Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines
Temple University – CIS Dept. CIS616– Principles of Data Management
Concurrent Cache-Oblivious B-trees Using Transactional Memory
Introduction to CILK Some slides are from:
Controlled Interleaving for Transactions
Concurrency control (OCC and MVCC)
CSE 542: Operating Systems
Presentation transcript:

Transactions with Nested Parallelism (Adding Transactions to Cilk) Kunal Agrawal, Jeremy T. Fineman, and Jim Sukha MIT CSAIL TRANSACT August 16, 2007

A Sample Cilk Program int supply1[10000]; int supply2[10000]; int N = 4; cilk int main() { spawn buy_computer(supply1); spawn buy_computer(supply2); sync; return 0; } cilk void buy_computer(int* c) { i = rand() % (10000–N); for (j = 0; j < N; j++) spawn buy_part(c, i+j); sync; } cilk void buy_part(int* c, int i) { c[i]--; } A Cilk program which updates inventory after buying two computers. Purchasing a computer decrements parts from the inventory arrays, supply1 and supply2, potentially in parallel.

Cilk Runtime and Performance 4 workers threads int supply1[10000]; int supply2[10000]; int N = 4; cilk int main() { spawn buy_computer(supply1); spawn buy_computer(supply2); sync; return 0; } cilk void buy_computer(int* c) { i = rand() % (10000–N); for (j = 0; j < N; j++) spawn buy_part(c, i+j); sync; } cilk void buy_part(int* c, int i) { c[i]--; } T1 = 16, T∞ = 5, P=4 The Cilk runtime schedules the program using work-stealing. Cilk executes a computation with work T1 and span (critical path) T∞ on P processors in time O(T1 / P + T∞), w.h.p.

Transactions in Cilk? int supply1[10000]; int N = 4; cilk int main() { spawn buy_computer(supply1); //X1 spawn buy_computer(supply1); //X2 sync; return 0; } cilk void buy_computer(int* c) { atomic { i = rand() % (10000–N); for (j = 0; j < N; j++) spawn buy_part(c, i+j); sync; } } cilk void buy_part(int* c, int i) { c[i]--; } 4 workers X1 X2 Suppose both computers require parts from the same inventory. Can we use transactions to ensure both calls to buy_computer() are atomic, even though X1 and X2 contain nested parallelism?

Transactions with Nested Parallelism and Nested Transactions? int supply1[10000]; int N = 4; cilk int main() { spawn buy_computer(supply1); //X1 spawn buy_computer(supply1); //X2 sync; return 0; } cilk void buy_computer(int* c) { atomic { for (j = 0; j < N; j++) { i = rand() % 10000; spawn buy_part(i); } sync; } } cilk void buy_part(int* c, int i) { atomic { c[i]--; } } 4 workers X1 X2 Suppose each computer can use more than one of the same part. Can we have parallel transactions nested inside X1 and X2?

Motivation: Library Functions int supply1[10000]; int N = 4; cilk int main() { spawn buy_computer(a); // X1 spawn buy_computer(a); // X2 sync; return 0; } cilk void buy_computer(int* c) { atomic { spawn foo(c); sync; } } ??? X1 ??? X2 If transactions can have nested parallelism and nested transactions, then we can composably call some library functions written using Cilk inside a transaction without knowing their exact implementation. cilk void foo(int* c) { ??? // spawn? }

XCilk Design We describe XCilk, a theoretical design for a software transactional memory system for Cilk which supports transactions with nested parallelism and nested transactions, both of unbounded nesting depth. XCilk uses an XConflict data structure to efficiently check for transaction conflicts. XCilk lazily cleans up memory locations on aborts.

XCilk Bounds on Overhead XCilk provides a provable bound on the overhead of TM in the following special case: For a computation with no transaction conflicts and no concurrent readers to a shared memory location, if the computation has work T1 and critical path T∞, XCilk executes the computation on P processors in time Linear speedup if P = O(√(T1/T∞)), vs. O(T1/T∞) for normal Cilk. O(T1 / P + PT∞). The XCilk runtime system still works correctly in the general case, with conflicts and parallel readers.

Outline Definition of Conflict in XCilk Efficient XConflict Queries

Summary of XCilk Semantics XCilk performs eager conflict detection. Transactions in XCilk are closed-nested. These two conditions imply a prefix race-free execution [ALS06]. If the effects of aborted transactions can be “ignored”, then prefix race-freedom ≈ serializability.

XCilk Computation Tree XCilk builds a computation tree as a transactional program executes. A program begins with a single root node (X0). X0 1 3 2 3 workers P1 S1 S2 X1 P2 P3 S3 S4 S5 S6 X2 Y1 P4 P5 u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 Z2 v2

XCilk Computation Tree (spawn) XCilk builds a computation tree as a transactional program executes. A program begins with a single root node (X0). X0 1 3 2 3 workers P1 S1 S2 A spawn creates a P-node (P1) with two S-nodes (S1, S2) as children. The worker then starts executing the left child. X1 P2 P3 S3 S4 S5 S6 X2 Y1 P4 P5 u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 Z2 v2

XCilk Computation Tree (steal) XCilk builds a computation tree as a transactional program executes. A program begins with a single root node (X0). X0 1 3 2 3 workers P1 S1 S2 A spawn creates a P-node (P1) with two S-nodes (S1, S2) as children. The worker then starts executing the left child. X1 P2 P3 S3 S4 S5 S6 X2 In XCilk, as in Cilk, a worker can steal an S-node (S2) from the deque of another worker. Y1 P4 P5 u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 Z2 v2

XCilk Computation Tree (xbegin) XCilk builds a computation tree as a transactional program executes. A program begins with a single root node (X0). X0 1 3 2 3 workers P1 S1 S2 A spawn creates a P-node (P1) with two S-nodes (S1, S2) as children. The worker then starts executing the left child. X1 P2 P3 S3 S4 S5 S6 X2 In XCilk, as in Cilk, a worker can steal an S-node (S2) from the deque of another worker. Y1 P4 P5 u1 S8 S9 S10 S11 An xbegin creates a transactional S-node (X1) Y2 P6 Z1 u2 v1 S12 S13 Z2 v2

XCilk: Readsets and Writesets W(X0)={L} Conceptually, every transaction X maintains a set of locations that the transaction read from (the readset R(X)), and a set of locations that the transaction as written to (the writeset W(X)).* The root of the tree represents the world; we assume the writeset of the root contains a value for all memory locations L. When a memory operation u1 on a location L occurs, it reads the value from the closest ancestor transaction with L in its readset. 1 3 2 3 workers P1 S1 S2 W(X1)= X1 P2 P3 S3 S4 S5 S6 X2 W(Y1)= Y1 P4 P5 u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 Z2 *We assume W(X) is a subset of R(X). v2

XCilk: write X0 W(X0)={L} Conceptually, every transaction X maintains a set of locations that the transaction read from (the readset R(X)), and a set of locations that the transaction as written to (the writeset W(X)).* The root of the tree represents the world; we assume the writeset of the root contains a value for all memory locations L. When a memory operation u1 on a location L occurs, it reads the value from the closest ancestor transaction with L in its readset. 1 3 2 3 workers P1 S1 S2 W(X1)= X1 P2 P3 S3 S4 S5 S6 X2 W(Y1)= {L} Y1 P4 P5 u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 u1: write to L Z2 v2

XCilk: xend An xend commits a transaction. With closed nesting, when a transaction (Y1) commits, it conceptually merges its readset and writeset into the readset/writeset of its transactional parent (X1). X0 W(X0)={L} 1 3 2 3 workers P1 S1 S2 W(X1)={L} X1 P2 P3 S3 S4 S5 S6 X2 W(Y1)= {L} Y1 P4 P5 u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 Z2 v2

XCilk: Conflicting write W(X0)={L} If v1 tries to write to L, XCilk detects a conflict, because X1 is not an ancestor of v1. X0 1 3 2 3 workers P1 S1 S2 W(X1)={L} X1 P2 P3 S3 S4 Since v1 conflicts with X1, XCilk can choose to abort Z1 immediately. S5 S6 X2 Y1 P4 P5 Alternatively, XCilk can also signal an abort of X1, wait for worker 1 to notice, and then finish v1. u1 S8 S9 S10 S11 Y2 P6 Z1 u2 v1 S12 S13 Z2 Conflict v2

Invariant: Conflict-Free Execution XCilk performs eager conflict detection, and guarantees that the execution is always conflict-free. At any time, for any given memory location L: 1 Inactive Aborted Write to L Read from L All active transactions that have L it their writeset fall along a chain. All active transactions with L in their readset are either along the chain or are descendants of the end of the chain. P S S P P S S S S P S S P X* S S 2

Conflicts with the Last Writer to L In the case where no transactions abort, XCilk reduces conflict detection to queries checking for conflicts against the id of the last transaction to write to L. Let Y be the last transaction which last wrote to L. X* must be an ancestor of Y, i.e., Y has “merged” into X* because of transaction commits. Inactive Write to L Read from L P S S P P If a transaction Z wants to perform a read from L, S S S S Z2 P Determine if X* is an ancestor of Z. If no, report a conflict. S S P X* S S Y Z1

The XConflict Oracle XConflict(Y,Z) answers this query: XConflict(*,Z1) 1 Y0 XConflict(Y,Z) answers this query: P Inactive For any running node Z, S S 1 Y1 2 int XConflict_Oracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } P P S S S S Y2 Y4 1 1 2 2 2 Z3 Z4 Y5 P 1 1 Y6 2 S S P P Y3 1 1 1 S S S S 1 1 1 1 2 Case 1: No conflict Case 2: Conflict Y7 Z1 Z2 The XConflict data structure is able to answer the query of XConflict_Oracle in O(1)-time.

Outline Definition of Conflict in XCilk Efficient XConflict Queries

Sources of Overhead in XCilk Assume we keep a history of accesses to each memory location L. The overhead in XCilk comes from two sources: Updates to the XConflict data structure / histories after a transaction Y commits. Queries to XConflict to check for conflicts on (potentially) every memory access.

Explicit Merges on Commit Option 1: Explicit Merge When we commit a transaction Y into its parent X, for every location L in Y’s readset and writeset, we change the id from Y to X in L’s history. Advantage: Faster queries. On a query, the last transaction to write to L is always active. X1 X2 Xd-2 Xd-1 Xd

Slow Commit with Explicit Merging W(X2)= {L0, L1, L2, L3 … Ld-1, Ld} Explicitly merging writesets immediately on transaction commits can blow up work by a factor proportional to the nesting depth. For example, consider a chain of nested transactions with depth d, with each transaction Xi accessing a different memory location Li. O(d) X1 W(X2)= {L1, L2, L3 … Ld-1, Ld} O(d-1) X2 W(X2)= {L2, L3 … Ld-1, Ld} . Xd-2 W(Xd-2)= {Ld-2, Ld-1, Ld} O(2) Xd-1 W(Xd-1)={Ld-1, Ld} No nesting: O(d) work. Closed nesting: O(d2) work. O(1) Xd W(Xd)= {Ld}

Implicit Merges Option 2: Implicit Merge a Y0 Option 2: Implicit Merge Inactive P Implicitly merge Y into its closest active transactional ancestor. On commit, do nothing to histories for L. S S b Y1 g Y8 P P S S S S Y2 Y4 Advantage: Fast updates. b c g h i Z3 Z4 Y5 P c d Y6 g S S P P Y3 b b b S S S S d d d e f Y7 Z1 Z2 Sets a through i represent groups of transactions which have merged together.

Implicit Merges with Slow Queries? a Y0 Option 2: Implicit Merge Inactive P Implicitly merge Y into its closest active transactional ancestor. On commit, do nothing to histories for L. S S b Y1 g Y8 P P S S S S Y2 Y4 Advantage: Fast updates. b c g h i Z3 Z4 Y5 P c d Y6 g Disadvantage: Slow queries? S S P P If the XConflict query must walk up the tree to determine which transaction Y has merged into, the query might require W(d) time. Y3 b b b S S S S d d d e f Y7 Z1 Z2 XCilk potentially performs an XConflict query on every memory access. Therefore, we need queries to take O(1) time!

The XConflict Query For any running node Z, XConflict(*,Z1) 1 Y0 For any running node Z, P Inactive int XConflict_Oracle(Y, Z) { X ← Y’s closest active ancestor transaction if (X is an ancestor of Z) return “no conflict” else return “conflict”; } S S 1 Y1 2 P P S S S S Y2 Y4 1 1 2 2 2 Z3 Z4 Y5 P 1 1 Y6 2 S S P P Y3 1 1 1 S S S S Case 1: No conflict Case 2: Conflict 1 1 1 1 2 Y7 Z1 Z2 XConflict can run in O(1) time because it does not always need to find X to answer the oracle query.

Trace Construction: Updates XCilk speeds up XConflict queries, by dividing the computation tree into traces, with each trace executed by a single worker.* X0 1 3 2 3 workers P1 S1 S2 In Cilk, traces are created by splitting on steals. X1 P2 P3 S3 S4 # traces = O(# steals) = O(PT∞) S5 S6 X2 Thus, we can afford to acquire a global lock on steals, and perform O(1) amortized work per trace. Y1 P5 P4 S8 S9 S10 S11 Nested transactions are merged together by merging traces together when traces complete. P6 Y2 Z1 S12 S13 *Trace construction is similar to the construction in [BFGL04, Fineman05], used for parallel race detection in Cilk. Z2

Trace Construction: Queries XCilk speeds up XConflict queries, by dividing the computation tree into traces, with each trace executed by a single worker.* An XConflict query involves a constant number of O(1)-time operations at two tiers: a global tier (queries between traces), and a local tier (between tree nodes). X0 1 3 2 3 workers P1 S1 S2 X1 P2 P3 S3 S4 S5 S6 X2 When there are no transaction conflicts and no parallel readers to the same memory location, XCilk performs (at most) one XConflict query per memory access. Y1 P5 P4 S8 S9 S10 S11 P6 Y2 Z1 S12 S13 *Trace construction is similar to the construction in [BFGL04, Fineman05], used for parallel race detection in Cilk. Z2

XCilk Performance Bound In the special case of a computation with no transaction conflicts and no concurrent readers to a shared memory location, XCilk performs (at most) one O(1)-time XConflict query per memory access. Maintaining the XConflict data structure introduces overhead of O(T1 / P + PT∞). Therefore, the entire program runs in time O(T1 / P + PT∞). Linear speedup if P = O(√(T1/T∞)), compared to P =O(T1/T∞) for normal Cilk.

Open Questions Can we provide any performance guarantees on programs which are conflict-free, but also allowing parallel reads to the same location? What if there are conflicts? Can we “garbage-collect” XCilk’s metadata (e.g., transaction ids) in a provably-efficient manner? Can we simplify the XCilk data structures in special but possibly “common” cases? For example, what if the nesting depth is bounded by d?