Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer Johannes Schneider
Without easy and efficient parallel programming methods… “computer science will become washing machine science.“ You probably no Moore’s law transistor count is doubling every 2 years, Since about 5 years, this mainly means the number of transistors is doubling Thus every desktop PC (and in the future also every mobile phone) is basically a parallel computer Johannes Schneider
How to handle access to shared data? Locks, Monitors… Coarse grained vs. fine grained locking easy but slow program demanding, time consuming but fast programs Problems difficult error prone Composability … Thread 1 Thread 2 lock all data modify/use data unlock all data lock A lock B modify/use A,B lock C modify/use A,B,C unlock A modify/use B,C unlock B,C lock B lock A modify/use A,B unlock A,B Only 1 thread can execute Deadlock! Little(no) parallelism lots of code, deadlocks… Johannes Schneider
Transactional memory(TM) - a possible solution Begin transaction modify/use data End transaction Simple for the programmer Composable Idea from database community Many TM systems (internally) still use locks But the TM system (not the programmer) takes care of Performance Correctness (no deadlocks...) Method A.x() Begin Transaction B.y() … End Transaction Method B.y() Begin transaction … End transaction Johannes Schneider
Transactional memory systems If transactions modify different data, everything is ok the same data, conflicts arise that must be resolved Transactions might get delayed or aborted Job of a contention manager A transaction keeps track of all modified values It restores all values, if it is aborted A transaction successfully finishes with a commit Only after the commit, other transactions notice its changes. Johannes Schneider
Conflicts – A contention manager decides Abort or delay a transaction, i.e. adapt load Distributed Each thread has its own manager Example Initially: A=1, B=1 Manager 1 Manager 2 Manager 1 Manager 2 Trans. 1 Trans. 2 Trans.1 Trans. 2 T1 T1 T1 … A:=2 B:=2 … A:=3 … A:=2 B:=2 … A:=3 conflict conflict If hold ressources, can still cause conflicts ->only partial load adaption Abort (undo all changes, i.e. set A:=1) and restart (after a while) Abort (set B:=1) and restart OR wait and retry Delay to adapt load! Johannes Schneider
Prior work Contention Managers [PODC03,PODC05,ISAAC09…] System load was not (explicitly) considered Load adaption (based on contention) Estimate contention intensity: CI [SPAA08] If abort: CI = a CI + (1-a) with parameter a [0,1] If commit: CI = a CI If CI > parameter b then resort to central scheduler Keep a transaction queue per core [PODC08] Central dispatcher assigns transactions to a core, i.e. its queue Each core iteratively executes transactions from queue If transaction A on core 1 is aborted due to B on core 2 then A is appended to the queue of core 2 Central scheduler will become a bottleneck D C Core 1 Core 2 A B Priority: according to work done or independent Number of cores will increase and this will not be feasible B aborts A D A Core 1 Core 2 C B Johannes Schneider
This paper Theoretical analysis Decentralized (simple) approaches to load adaption based on contention performance hard to compare: different benchmarks, different hardware Based on contention… Contention manager Resolves conflict by using priorities of transactions Adapts load by preventing transactions from running Johannes Schneider
Strategies Ignore: Do not learn from conflicts Conflict graph A conflicted with C D conflicted with B Strategies Ignore: Do not learn from conflicts ImmediateRestart Stay real: Remember faced conflicts SerializeFacedConflicts Do not schedule prior conflicting transactions concurrently Be cautious: Assume additional conflicts SerializeAll All transactions in a subgraph are assumed to conflict A B C D A B C D SerializeAll = A B A B C D C D Johannes Schneider
Load Adaption Strategies AbortBackoff If aborted wait for a random time [0,2#aborts] Priority = number of aborts #aborts Who wins a conflict? 2 strategies Estimate the work done Unrelated to work done #aborted more often higher chances to commit : Longer running transaction wins : Give random priority (on startup) Johannes Schneider
Look at several scenarios Theory Part - Model n transactions (and threads) Start concurrently on n cores Transaction sequence of operations operation takes 1 time unit duration (number of operations) tT is fixed 2 types of operations Write = modify (shared) resource and lock it until commit Compute/abort/commit Ignore overhead of load adaption Remembering transactions, scheduling… Core 1 Core 2 Core n A … B Z A Postponing a transaction means preempting a core Fixed up to up to a constant A conflict arises if transaction A wants to lock a resource that is already locked by B Look at several scenarios moderate and substantial parallelism Johannes Schneider
Moderate parallelism Shared counter Linked List Conflicts directly after transaction start Linked List Conflicts at arbitrary time Expected time span until all transactions committed Speed-up log n (at best) Policy Counter List ImmediateRestart AbortBackoff SerializeFacedConflicts SerializeAll Only sequential execution possible for counter, load adaption should be very beneficial! List go through remember all read objects => not possible that to commit concurrently, but directly after each other n*tT lower bound Immediately restarting causes a penality… Serialization everything is best Aborting seems not too good give an analysis Transaction run time #transactions Johannes Schneider
Substantial parallelism Worst case Conflict graph is d-ary tree of logarithmic height Exponential gap in worst case SerializeAll and others T1 T2 T3 T4 T5 … Policy Time until transactions committed ImmediateRestart AbortBackoff SerializeFacedConflicts SerializeAll Load adaption not really helpful, Can load adaption harm? AbortBackoff pretty stable, serializing faced conflicts seems even better… Johannes Schneider
Practical investigation Remembering conflicts causes too much overhead Good for analysis but not for implementation Quickadapter Serializes transactions Each core has a “waiting” flag If aborted, set flag and wait until flag unset If commit, unset some flag AbortBackOff (Also considered some variants) or after a fixed time Johannes Schneider
Practical investigation Evaluation on 16 core machine DSTM2 system Visible readers Six benchmarks Little parallelism Shared counter, Sorted List (accessed objects not released), Listcounter Considerable parallelism Red Black Tree, LFUCache, RandomAccessArray Compare new load adaption policies to existing contention managers 4 x 4 quad core, AMD 2 Ghz Johannes Schneider
Discussion Hard to keep maximum throughput, also in [SPAA08, PODC08] Even without conflicts Improvement for 1 benchmark worsens another On average better than schemes without load adaption * Visible readers cause cache misses * Load on memory system * PODC 08 mainly list benchmark, cannot see much due to scaling of figures… * (Praised) Polka contention manager performs badly [PODC05] Also in [Ansari09] Johannes Schneider
Conclusion Simple and distributed load adaption strategies Theory (For now) constants and parameters matter a lot Practice Hard to keep load at peak for all usage patterns state-of the art desktops only 4 cores parameters, e.g. initial backoff can be tuned… Effects beyond conflicts seem to matter E.g. Cache line misses Load adaption more beneficial than decision whom to abort Johannes Schneider
Thanks for your attention! Questions? ??? \vspace{10pt} Johannes Schneider
Analysis AbortBackoff for counter Recall: If aborted wait for a random time [0,2#aborts] Assume #aborts ~ log (ntT) + x (for some x) Define: a(x) := fraction of active nodes a(0) = 1 (after time ~2log (ntT) = ntT a constant fraction still active) Chance conflict for interval [0,2#aborts] Interval [0, 2log(ntT)+x ] ~ a(x) ntT / 2log (ntT) +x = a(x) /2x a(x+1) = a(x)/2x = 1/2∑i=0..x i ~ 1/2x2 a(√log n) = 1/2(√log n)2 = 1/n ∑i=0.. log (ntT) +√log n length interval = ∑i=0.. .. log (ntT) +√log n 2i = ntT 2√log n+1 wait freedom (strongest guarantee) all threads(transactions) make progress in a finite number of steps lock freedom one thread makes progress in a finite number of steps obstruction freedom (weakest) a thread makes progress in a finite number of steps in absence of contention (no conflicts, no shared data) T1 T2 T3 a(x)ntT = 3/n n tT = 3tT Johannes Schneider