1 Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar,

1 Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar, Satnam Singh, David Tarditi

2 Moore’s law 1970 1980 1990 2000 2005 1,000 1,000,000 1,000,000,000 4004 8086 Intel 486 Intel 386 286 Intel Pentium Intel Pentium II Intel Pentium 4 Intel Itanium 2 Source: table from http://www.intel.com/technology/mooreslaw/index.htm Pipelined execution Superscalar processors Data-parallel extensions Increasing ILP through speculation Increases in clock frequency Increases in ILP Parallelism

3 Source: viewed 15 Nov 2006

4 Making use of parallelism hardware is hard How do we share data between concurrent threads?How do we share data between concurrent threads? How do we find things to run in parallel?How do we find things to run in parallel? Both of these are difficult problems: in this talk I’ll focus on the first problem Server workloads often provide a natural source of parallelism: deal with different clients concurrently Traditional HPC is reasonably understood too: problems can be partitioned, or parallel building blocks composed Programmers aren’t good at thinking about thread interleavings …and locks provide a complicated and non- composable solution

5 Overview IntroductionIntroduction Problems with locksProblems with locks Controlling concurrent accessControlling concurrent access Research challengesResearch challenges ConclusionsConclusions

6 11 14 3 5 67 12630 PushRight / PopRight work on the right hand end PushLeft / PopLeft work on the left hand end Make it thread-safe by adding a lockMake it thread-safe by adding a lock –Needlessly serializes operations when the queue is long One lock for each end?One lock for each end? –Much more complex when the queue is short (0, 1, 2 items) A deceptively simple problem

7 Locks are broken Races: due to forgotten locksRaces: due to forgotten locks Deadlock: locks acquired in “wrong” order.Deadlock: locks acquired in “wrong” order. Lost wakeups: forgotten notify to condition variableLost wakeups: forgotten notify to condition variable Error recovery tricky: need to restore invariants and release locks in exception handlersError recovery tricky: need to restore invariants and release locks in exception handlers Simplicity vs scalability tensionSimplicity vs scalability tension Lack of progress guaranteesLack of progress guarantees …but worst of all… locks do not compose…but worst of all… locks do not compose

8 Atomic memory transactions To a first approximation, just write the sequential code, and wrap atomic around itTo a first approximation, just write the sequential code, and wrap atomic around it All-or-nothing semantics: atomic commitAll-or-nothing semantics: atomic commit Atomic block executes in isolationAtomic block executes in isolation Cannot deadlock (there are no locks!)Cannot deadlock (there are no locks!) Atomicity makes error recovery easy (e.g. exception thrown inside the Get code)Atomicity makes error recovery easy (e.g. exception thrown inside the Get code) Item Get() { atomic {... sequential get code... } }

9 Atomic blocks compose (locks do not) Guarantees to get two consecutive itemsGuarantees to get two consecutive items Get() is unchanged (the nested atomic is a no-op)Get() is unchanged (the nested atomic is a no-op) Cannot be achieved with locks (except by breaking the Get() abstraction)Cannot be achieved with locks (except by breaking the Get() abstraction) void GetTwo() { atomic { i1 = Get(); i2 = Get(); } DoSomething( i1, i2 ); }

10 Blocking: how does Get() wait for data? retry means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)”retry means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)” No lost wake-upsNo lost wake-ups No consequential change to GetTwo(), even though GetTwo must wait for there to be two items in the queueNo consequential change to GetTwo(), even though GetTwo must wait for there to be two items in the queue Item Get() { atomic { if (size==0) { retry; } else {...remove item from queue... }}

11 Choice: waiting for either of two queues do {...this...} orelse {...that...} tries to run “this”do {...this...} orelse {...that...} tries to run “this” If “this” retries, it runs “that” insteadIf “this” retries, it runs “that” instead If both retry, the do-block retries. GetEither() will thereby wait for there to be an item in either queueIf both retry, the do-block retries. GetEither() will thereby wait for there to be an item in either queue Q1 Q2 R void GetEither() { atomic { do { i = Q1.Get(); } orelse { i = Q2.Get(); } R.Put( i );}

12 Why is ‘orelse’ left-biased? Blocking v polling: No need to write separate blocking / polling methodsNo need to write separate blocking / polling methods Can extend to deal with timeouts by using a ‘wakeup’ message on a shared memory bufferCan extend to deal with timeouts by using a ‘wakeup’ message on a shared memory buffer Item TryGet() { atomic { do { return Get(); } orelse { return null; }}}

13 Choice: a server waiting on two clients atomic { Item i; do { i = q1.PopLeft(3) } orelse { i = q2.PopLeft (3) } // Process item } atomic { do { WaitForShutdownRequest(); // Process shutdown request } orelse { }} If q1.PopLeft can’t finish then undo its side effects and try q2.PopLeft instead Unlike WaitAny or Select we can compose code using retry – e.g. wait for a shutdown request, or for client requests Try this first – if it gets an item without hitting retry then we’re done

14 Invariants Suppose a queue should always have at least one element in it; how can we check this?Suppose a queue should always have at least one element in it; how can we check this? –Find all the places where queues are modified and put in a check? –Define ‘NonEmptyQueue’ which wraps the ordinary queue operations? The check is duplicated (and possibly forgotten)The check is duplicated (and possibly forgotten) The invariant may be associated with multiple data structures (“queue X contains fewer elements than queue Y”)The invariant may be associated with multiple data structures (“queue X contains fewer elements than queue Y”) What if different queues have different constraints?What if different queues have different constraints?

15 Define invariants that must be preserved by successful atomic blocksDefine invariants that must be preserved by successful atomic blocks Invariants Queue NewNeverEmptyQueue(Item i) { atomic { Queue q = new Queue(); q.Put(i); always { !q.IsEmpty(); } return q;} Computation to check the invariant: must be true when proposed, and preserved by top- level atomic blocks Queue q = NewNeverEmptyQueue(i); atomic { q.Get(); } Invariant failure: would leave q empty

16 Summary so far With locks, you think about: Which lock protects which data? What data can be mutated when by other threads? Which condition variables must be notified when?Which lock protects which data? What data can be mutated when by other threads? Which condition variables must be notified when? None of this is explicit in the source codeNone of this is explicit in the source code With atomic blocks you think about Purely sequential reasoning within a block, which is dramatically easierPurely sequential reasoning within a block, which is dramatically easier What are the invariants (e.g. the tree is balanced)?What are the invariants (e.g. the tree is balanced)? Each atomic block maintains the invariantsEach atomic block maintains the invariants Much easier setting for static analysis toolsMuch easier setting for static analysis tools

17 Summary so far Atomic blocks (atomic, retry, orElse, always) are a real step forwardAtomic blocks (atomic, retry, orElse, always) are a real step forward It’s like using a high-level language instead of assembly code: whole classes of low-level errors are eliminatedIt’s like using a high-level language instead of assembly code: whole classes of low-level errors are eliminated Not a silver bullet:Not a silver bullet: –you can still write buggy programs; –concurrent programs are still harder to write than sequential ones; –aimed at shared memory But the improvement remains substantialBut the improvement remains substantial

18 Direct-update STM (but: see programming model implications later)Direct-update STM (but: see programming model implications later) –Locking to allow transactions to make updates in place in the heap –Log overwritten values –Reads naturally see earlier writes that the transaction has made (either use opt or pess concurrency control) –Makes successful commit operations faster at the cost of extra work on contention or when a transaction aborts Compiler integrationCompiler integration –Decompose the transactional memory operations into primitives –Expose the primitives to compiler optimization (e.g. to hoist concurrency control operations out of a loop) Runtime system integrationRuntime system integration –Integration with the garbage collector or runtime system components to scale to atomic blocks containing 100M accesses Can they be implemented with acceptable perf?

19 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 T1’s log: ver = 200 val = 40 c2 ver = 100 val = 10 c1 T2’s log:

20 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 T1’s log: ver = 200 val = 40 c2 ver = 100 val = 10 c1 T2’s log: T1 reads from c1: logs that it saw version 100

21 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 T1’s log: ver = 200 val = 40 c2 ver = 100 val = 10 c1 c1.ver=100 T2’s log: T2 also reads from c1: logs that it saw version 100

22 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 c2.ver=200 T1’s log: ver = 200 val = 40 c2 ver = 100 val = 10 c1 c1.ver=100 T2’s log: Suppose T1 now reads from c2, sees it at version 200

23 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 c2.ver=200 T1’s log: ver = 200 val = 40 c2 locked:T2 val = 10 c1 c1.ver=100 lock: c1, 100 T2’s log: Before updating c1, thread T2 must lock it: record old version number

24 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 c2.ver=200 T1’s log: ver = 200 val = 40 c2 locked:T2 val = 11 c1 c1.ver=100 lock: c1, 100 c1.val=10 T2’s log: (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite (2) After logging the old value, T2 makes its update in place to c1

25 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 c2.ver=200 T1’s log: ver = 200 val = 40 c2 ver=101 val = 10 c1 T2’s log: c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read (2) T2’s transaction commits successfully. Unlock the object, installing the new version number

26 Example: contention between transactions atomic { t = c1.val; t ++; c1.val = t; } Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } Thread T1 c1.ver=100 c2.ver=100 T1’s log: ver = 200 val = 40 c2 ver=101 val = 10 c1 T2’s log: (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.

27 We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL)We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) –OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) –OpenForUpdate – before the first time we update an object –LogOldValue – before the first time we write to a given field atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next; Source codeBasic intermediate codeOptimized intermediate code Compiler integration

28 Compiler integration – avoiding upgrades atomic { … c1.val ++; … } OpenForRead(c1); temp1 = c1.val; temp1 ++; OpenForUpdate(c1); LogOldValue(&c1.val); c1.val = temp1; OpenForUpdate(c1); temp1 = c1.val; temp1 ++; LogOldValue(&c1.val); c1.val = temp1 Source code Compiler’s intermediate code Optimized intermediate code

29 Compiler integration – other optimizations Hoist OpenFor* and Log* operations out from loopsHoist OpenFor* and Log* operations out from loops Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must be thread local)Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must be thread local) Move OpenFor* operations from methods to their callersMove OpenFor* operations from methods to their callers Further decompose operations to allow logging-space checks to be combinedFurther decompose operations to allow logging-space checks to be combined Expose OpenFor* and Log*’s implementation to inlining and further optimizationExpose OpenFor* and Log*’s implementation to inlining and further optimization

30 Results: concurrency control overhead Normalised execution time Sequential baseline (1.00x) Coarse-grained locking (1.13x) Fine-grained locking (2.57x) Direct-update STM (2.04x) Direct-update STM + compiler integration (1.46x) Buffer memory accesses, non- blocking commit (5.69x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore

31 Results: scalability #threads Fine-grained locking Direct-update STM + compiler integration Buffered-update non-blocking STM Coarse-grained locking Microseconds per operation

32 Results: long running tests treegoskipmerge-sortxlisp Normalised execution time 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx)

33 Research challenges

34 The transactional elephant How do all these issues fit into the bigger picture (just the trunk, as it were, of the concurrency elephant) Semantics of atomic blocks, progress guarantees, forms of nesting, interaction outside the transaction Other applications of transactional memory – e.g. speculation Software transactional memory algorithms and APIs Compiling and optimizing atomic blocks to STM APIs Hardware transactional memory Hardware acceleration of software transactional memory Integration with other resources – tx IO etc.

35 Semantics of atomic blocks What does it mean for a program using atomic blocks to be correctly synchronized?What does it mean for a program using atomic blocks to be correctly synchronized? What guarantees are lost / retained in programs which are not correctly synchronized?What guarantees are lost / retained in programs which are not correctly synchronized? What tools and analysis techniques can we develop to identify problems, or to prove correct synchronization?What tools and analysis techniques can we develop to identify problems, or to prove correct synchronization?

36 Tempting goal: “strong” atomicity “Atomic means atomic” – as if the block runs with other threads suspended and outside any atomic blocks“Atomic means atomic” – as if the block runs with other threads suspended and outside any atomic blocks Problems:Problems: –Is it practical to implement? –Is it actually useful? atomic { x ++;x++; } The atomic action can run at any point during ‘x++’

37 One approach: single lock semantics Think of atomic blocks as acquiring & releasing a single process-wide lockThink of atomic blocks as acquiring & releasing a single process-wide lock If the resulting single-lock program is correctly synchronized then so is the originalIf the resulting single-lock program is correctly synchronized then so is the original atomic { x ++; } atomic { y ++; } lock (TxLock) { x ++; } lock (TxLock) { y ++; }

38 One approach: single lock semantics atomic { x ++; }atomic { x++ ;} atomic { lock (x) {lock (x) { x ++;x ++; }} } atomic { x ++; } x++ ; atomic { x ++; }lock (x) { x++ ;} Correct: consistently protected by atomic blocks Not correctly synchronized: race on x Correct: consistently protected by “x”s lock Not correctly synchronized: no interatcion between “TxLock” and “x”s lock

39 Single lock semantics & privatization This is correctly synchronized under single-lock semantics: (initially x=0, x_shared=true)This is correctly synchronized under single-lock semantics: (initially x=0, x_shared=true) Problems both with optimistic concurrency control (either in-place or deferred updates), also with non-strict pessimistic concurrency controlProblems both with optimistic concurrency control (either in-place or deferred updates), also with non-strict pessimistic concurrency control Q: is this a problem in the semantics or the impl’?Q: is this a problem in the semantics or the impl’? atomic { // Tx1 x_shared = false; } x ++; atomic { // Tx2 if (x_shared) { x ++; } Single lock semantics => x=1 or x=2 after both threads’ atcions

40 Segregated heap semantics Many problems are simplified if we distinguish tx and non- tx accessible dataMany problems are simplified if we distinguish tx and non- tx accessible data –tx-accessible data is always and only accessed within tx –non-tx accessible data is never accessed within tx This is the model we provide in STM-HaskellThis is the model we provide in STM-Haskell –A TVar a is a tx-mutable cell holding a value of type ‘a’ –A computation of type STM a is only allowed to update TVars, not other mutable cells –A computation of type IO a can mutate other cells, and only TVars by entering atomic blocks …but vast annotation or expressiveness burden trying to control effects to the same extent in other languages…but vast annotation or expressiveness burden trying to control effects to the same extent in other languages

41 Conclusions

42 Conclusions Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for building shared-memory concurrent data structuresAtomic blocks and transactional memory provide a fundamental improvement to the abstractions for building shared-memory concurrent data structures –Only one part of the concurrency landscape –But we believe no one solution will solve all the challenges there Our experience with compiler integration shows that a pure software implementation can perform well for some workloadsOur experience with compiler integration shows that a pure software implementation can perform well for some workloads –We still need a better understanding of realistic workloads –We need this to inform the design of the semantics of atomic blocks, as well as to guide the optimization of their implementation

1 Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar,

Similar presentations

Presentation on theme: "1 Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar,

Similar presentations

Presentation on theme: "1 Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar,"— Presentation transcript:

Similar presentations

About project

Feedback