Transactional Memory: How to Perform Load Adaption

Slides:



Advertisements
Similar presentations
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Advertisements

Priority INHERITANCE PROTOCOLS
Raphael Eidenbenz Roger Wattenhofer Roger Wattenhofer Good Programming in Transactional Memory Game Theory Meets Multicore Architecture.
1 Concurrency Control Chapter Conflict Serializable Schedules  Two actions are in conflict if  they operate on the same DB item,  they belong.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Steal-on-abort Improving Transactional Memory Performance through Dynamic Transaction Reordering Mohammad Ansari University of Manchester.
Concurrency Control II. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Lock-Based Concurrency Control
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Concurrency Control and Recovery In real life: users access the database concurrently, and systems crash. Concurrent access to the database also improves.
1 Johannes Schneider Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
Selfishness in Transactional Memory Raphael Eidenbenz, Roger Wattenhofer Distributed Computing Group Game Theory meets Multicore Architecture.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Software Transaction Memory for Dynamic-Sized Data Structures presented by: Mark Schall.
Window-Based Greedy Contention Management for Transactional Memory Gokarna Sharma (LSU) Brett Estrade (Univ. of Houston) Costas Busch (LSU) 1DISC 2010.
Oct Multi-threaded Active Objects Ludovic Henrio, Fabrice Huet, Zsolt Istvàn June 2013 –
Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Scheduling policies for real- time embedded systems.
Chapter 11 Concurrency Control. Lock-Based Protocols  A lock is a mechanism to control concurrent access to a data item  Data items can be locked in.
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.
1 Concurrency Control Lecture 22 Ramakrishnan - Chapter 19.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
Performance Performance is about time and the software system’s ability to meet timing requirements.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Silberschatz, Galvin and Gagne ©2009 Edited by Khoury, 2015 Operating System Concepts – 9 th Edition, Chapter 7: Deadlocks.
Window-Based Greedy Contention Management for Transactional Memory Gokarna Sharma (LSU) Brett Estrade (Univ. of Houston) Costas Busch (LSU) DISC
Process Management Deadlocks.
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
EMERALDS Landon Cox March 22, 2017.
Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.
Advanced Topics in Concurrency and Reactive Programming: Asynchronous Programming Majeed Kassis.
Concurrency Control Techniques
Faster Data Structures in Transactional Memory using Three Paths
Concurrency Control.
Course Description Algorithms are: Recipes for solving problems.
Challenges in Concurrent Computing
EE 193: Parallel Computing
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Designing Parallel Algorithms (Synchronization)
Lecture 6: Transactions
Chapter5: CPU Scheduling
CSCI1600: Embedded and Real Time Software
Yiannis Nikolakopoulos
Chapter 6: CPU Scheduling
Chapter 15 : Concurrency Control
Hybrid Transactional Memory
Distributed Transactions
Chapter 5: CPU Scheduling
Chapter 11 I/O Management and Disk Scheduling
Software Transactional Memory Should Not be Obstruction-Free
Kernel Synchronization II
Programming with Shared Memory Specifying parallelism
Lecture 23: Transactional Memory
Course Description Algorithms are: Recipes for solving problems.
CONCURRENCY Concurrency is the tendency for different tasks to happen at the same time in a system ( mostly interacting with each other ) .   Parallel.
CSE 542: Operating Systems
Presentation transcript:

Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer Johannes Schneider

Without easy and efficient parallel programming methods… “computer science will become washing machine science.“ You probably no Moore’s law transistor count is doubling every 2 years, Since about 5 years, this mainly means the number of transistors is doubling Thus every desktop PC (and in the future also every mobile phone) is basically a parallel computer Johannes Schneider

How to handle access to shared data? Locks, Monitors… Coarse grained vs. fine grained locking easy but slow program demanding, time consuming but fast programs Problems difficult error prone Composability … Thread 1 Thread 2 lock all data modify/use data unlock all data lock A lock B modify/use A,B lock C modify/use A,B,C unlock A modify/use B,C unlock B,C lock B lock A modify/use A,B unlock A,B Only 1 thread can execute Deadlock! Little(no) parallelism lots of code, deadlocks… Johannes Schneider

Transactional memory(TM) - a possible solution Begin transaction modify/use data End transaction Simple for the programmer Composable Idea from database community Many TM systems (internally) still use locks But the TM system (not the programmer) takes care of Performance Correctness (no deadlocks...) Method A.x() Begin Transaction B.y() … End Transaction Method B.y() Begin transaction … End transaction Johannes Schneider

Transactional memory systems If transactions modify different data, everything is ok the same data, conflicts arise that must be resolved Transactions might get delayed or aborted Job of a contention manager A transaction keeps track of all modified values It restores all values, if it is aborted A transaction successfully finishes with a commit Only after the commit, other transactions notice its changes. Johannes Schneider

Conflicts – A contention manager decides Abort or delay a transaction, i.e. adapt load Distributed Each thread has its own manager Example Initially: A=1, B=1 Manager 1 Manager 2 Manager 1 Manager 2 Trans. 1 Trans. 2 Trans.1 Trans. 2 T1 T1 T1 … A:=2 ‏ B:=2 … A:=3 … A:=2 B:=2 … A:=3 conflict conflict If hold ressources, can still cause conflicts ->only partial load adaption Abort (undo all changes, i.e. set A:=1)‏ and restart (after a while) Abort (set B:=1) and restart OR wait and retry Delay to adapt load! Johannes Schneider

Prior work Contention Managers [PODC03,PODC05,ISAAC09…] System load was not (explicitly) considered Load adaption (based on contention) Estimate contention intensity: CI [SPAA08] If abort: CI = a CI + (1-a) with parameter a [0,1] If commit: CI = a CI If CI > parameter b then resort to central scheduler Keep a transaction queue per core [PODC08] Central dispatcher assigns transactions to a core, i.e. its queue Each core iteratively executes transactions from queue If transaction A on core 1 is aborted due to B on core 2 then A is appended to the queue of core 2 Central scheduler will become a bottleneck D C Core 1 Core 2 A B Priority: according to work done or independent Number of cores will increase and this will not be feasible B aborts A D A Core 1 Core 2 C B Johannes Schneider

This paper Theoretical analysis Decentralized (simple) approaches to load adaption based on contention performance hard to compare: different benchmarks, different hardware Based on contention… Contention manager Resolves conflict by using priorities of transactions Adapts load by preventing transactions from running Johannes Schneider

Strategies Ignore: Do not learn from conflicts Conflict graph A conflicted with C D conflicted with B Strategies Ignore: Do not learn from conflicts ImmediateRestart Stay real: Remember faced conflicts SerializeFacedConflicts Do not schedule prior conflicting transactions concurrently Be cautious: Assume additional conflicts SerializeAll All transactions in a subgraph are assumed to conflict A B C D A B C D SerializeAll = A B A B C D C D Johannes Schneider

Load Adaption Strategies AbortBackoff If aborted wait for a random time [0,2#aborts] Priority = number of aborts #aborts Who wins a conflict? 2 strategies Estimate the work done Unrelated to work done #aborted more often higher chances to commit : Longer running transaction wins : Give random priority (on startup) Johannes Schneider

Look at several scenarios Theory Part - Model n transactions (and threads) Start concurrently on n cores Transaction sequence of operations operation takes 1 time unit duration (number of operations) tT is fixed 2 types of operations Write = modify (shared) resource and lock it until commit Compute/abort/commit Ignore overhead of load adaption Remembering transactions, scheduling… Core 1 Core 2 Core n A … B Z A Postponing a transaction means preempting a core Fixed up to up to a constant A conflict arises if transaction A wants to lock a resource that is already locked by B Look at several scenarios moderate and substantial parallelism Johannes Schneider

Moderate parallelism Shared counter Linked List Conflicts directly after transaction start Linked List Conflicts at arbitrary time Expected time span until all transactions committed Speed-up log n (at best) Policy Counter List ImmediateRestart AbortBackoff SerializeFacedConflicts SerializeAll Only sequential execution possible for counter, load adaption should be very beneficial! List go through remember all read objects => not possible that to commit concurrently, but directly after each other n*tT lower bound Immediately restarting causes a penality… Serialization everything is best Aborting seems not too good give an analysis Transaction run time #transactions Johannes Schneider

Substantial parallelism Worst case Conflict graph is d-ary tree of logarithmic height Exponential gap in worst case SerializeAll and others T1 T2 T3 T4 T5 … Policy Time until transactions committed ImmediateRestart AbortBackoff SerializeFacedConflicts SerializeAll Load adaption not really helpful, Can load adaption harm? AbortBackoff pretty stable, serializing faced conflicts seems even better… Johannes Schneider

Practical investigation Remembering conflicts causes too much overhead Good for analysis but not for implementation Quickadapter Serializes transactions Each core has a “waiting” flag If aborted, set flag and wait until flag unset If commit, unset some flag AbortBackOff (Also considered some variants) or after a fixed time Johannes Schneider

Practical investigation Evaluation on 16 core machine DSTM2 system Visible readers Six benchmarks Little parallelism Shared counter, Sorted List (accessed objects not released), Listcounter Considerable parallelism Red Black Tree, LFUCache, RandomAccessArray Compare new load adaption policies to existing contention managers 4 x 4 quad core, AMD 2 Ghz Johannes Schneider

Discussion Hard to keep maximum throughput, also in [SPAA08, PODC08] Even without conflicts Improvement for 1 benchmark worsens another On average better than schemes without load adaption * Visible readers cause cache misses * Load on memory system * PODC 08 mainly list benchmark, cannot see much due to scaling of figures… * (Praised) Polka contention manager performs badly [PODC05] Also in [Ansari09] Johannes Schneider

Conclusion Simple and distributed load adaption strategies Theory (For now) constants and parameters matter a lot Practice Hard to keep load at peak for all usage patterns state-of the art desktops only 4 cores parameters, e.g. initial backoff can be tuned… Effects beyond conflicts seem to matter E.g. Cache line misses Load adaption more beneficial than decision whom to abort Johannes Schneider

Thanks for your attention! Questions? ??? \vspace{10pt} Johannes Schneider

Analysis AbortBackoff for counter Recall: If aborted wait for a random time [0,2#aborts] Assume #aborts ~ log (ntT) + x (for some x) Define: a(x) := fraction of active nodes a(0) = 1 (after time ~2log (ntT) = ntT a constant fraction still active) Chance conflict for interval [0,2#aborts] Interval [0, 2log(ntT)+x ] ~ a(x) ntT / 2log (ntT) +x = a(x) /2x a(x+1) = a(x)/2x = 1/2∑i=0..x i ~ 1/2x2 a(√log n) = 1/2(√log n)2 = 1/n ∑i=0.. log (ntT) +√log n length interval = ∑i=0.. .. log (ntT) +√log n 2i = ntT 2√log n+1 wait freedom (strongest guarantee) all threads(transactions) make progress in a finite number of steps lock freedom one thread makes progress in a finite number of steps obstruction freedom (weakest) a thread makes progress in a finite number of steps in absence of contention (no conflicts, no shared data) T1 T2 T3 a(x)ntT = 3/n n tT = 3tT Johannes Schneider