Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal.

Slides:



Advertisements
Similar presentations
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 11 Memory Management C makes it easy to shoot.
Advertisements

Pay-to-use strong atomicity on conventional hardware Martín Abadi, Tim Harris, Mojtaba Mehrara Microsoft Research.
QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Ferad Zyulkyarov 1,2, Tim Harris 3, Osman S. Unsal 1, Adrián Cristal 1, Mateo Valero 1,2 1 BSC-Microsoft Research Centre 2 Universitat Politècnica de Catalunya.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Transactional Memory Guest Lecture Design of Parallel and High-Performance Computing Georg Ofenbeck TexPoint fonts used in EMF. Read the TexPoint manual.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
Memory Allocation. Three kinds of memory Fixed memory Stack memory Heap memory.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
Exceptions and side-effects in atomic blocks Tim Harris.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Dynamic Runtime Testing for Cycle-Accurate Simulators Saša Tomić, Adrián Cristal, Osman Unsal, Mateo Valero Barcelona Supercomputing Center (BSC) Universitat.
Precision Going back to constant prop, in what cases would we lose precision?
JS Arrays, Functions, Events Week 5 INFM 603. Agenda Arrays Functions Event-Driven Programming.
1 Scalable and transparent parallelization of multiplayer games Bogdan Simion MASc thesis Department of Electrical and Computer Engineering.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Learning, Monitoring, and Repair in Application Communities Martin Rinard Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
10/16/2015IT 3271 All about binding n Variables are bound (dynamically) to values n values must be stored somewhere in the memory. Memory Locations for.
Computer Science Detecting Memory Access Errors via Illegal Write Monitoring Ongoing Research by Emre Can Sezer.
Arrays Chapter 7. 2 "All students to receive arrays!" reports Dr. Austin. Declaring arrays scores : Inspecting.
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,
WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman.
Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
C# EMILEE KING. HISTORY OF C# In the late 1990’s Microsoft recognized the need to be able to develop applications that can run on multiple operating system.
CS333 Intro to Operating Systems Jonathan Walpole.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.
Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.
Chapter 1 Java Programming Review. Introduction Java is platform-independent, meaning that you can write a program once and run it anywhere. Java programs.
NCHU System & Network Lab Lab #6 Thread Management Operating System Lab.
Verifying Transactional Programs with Programmer-Defined Conflict Detection Omer Subasi, Serdar Tasiran (Koç University) Tim Harris (Microsoft Research)
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
Introduction to operating systems What is an operating system? An operating system is a program that, from a programmer’s perspective, adds a variety of.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
Buffering Techniques Greg Stitt ECE Department University of Florida.
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Arrays Chapter 7.
Ferad Zyulkyarov1,2, Srdjan Stipic1,2, Tim Harris3, Osman S. Unsal1,
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Atomic Operations in Hardware
CS399 New Beginnings Jonathan Walpole.
Atomic Operations in Hardware
Concepts of programming languages
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Chapter 26 Concurrency and Thread
Lecture 6: Transactions
Part 1: Concepts and Hardware- Based Approaches
Arrays Chapter 7.
CSE 542: Operating Systems
CSE 542: Operating Systems
Dirty COW Race Condition Attack
Presentation transcript:

Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal 1, Adrián Cristal 1,4, Ibrahim Hur 1, Mateo Valero 1,2 1 BSC-Microsoft Research Centre 2 Universitat Politècnica de Catalunya 3 Microsoft Research Cambridge 4 IIIA - Artificial Intelligence Research Institute CSIC - Spanish National Research Council 19th International Conference on Parallel Architectures and Compilation Techniques September 2010 – Vienna

Abstract the TM Implementation 2 for (i = 0; i < N; i++) { atomic { x[i]++; } } for (i = 0; i < N; i++) { atomic { y[i]++; } } Thread 1Thread 2 Accesses to different arrays. We can observe overheads inherent to the TM implementation. We are not interested in such bottlenecks.

Abstract the TM Implementation 3 for (i = 0; i < N; i++) { atomic { x[i]++; } } for (i = 0; i < N; i++) { atomic { x[i]++; } } Thread 1Thread 2 Accesses to the same arrays. Contention: Bottleneck common to all implementations of the TM programming model. Contention: Bottleneck common to all implementations of the TM programming model. We are interested in this kind of bottlenecks.

Can We Find This Kind of Bottlenecks? 4 atomic { statement1; statement2; statement3; statement4; } Abort rate 80% Where aborts happen? Which variables conflict? Are there false conflicts?

Can We Find This Kind of Bottlenecks? 5 atomic { statement1; statement2; statement3; statement4; } counter1=0; counter2=0; counter3=0; counter4=0;

Can We Find This Kind of Bottlenecks? 6 atomic { statement1; statement2; statement3; statement4; } counter1=1; counter2=0; counter3=0; counter4=0;

Can We Find This Kind of Bottlenecks? 7 atomic { statement1; statement2; statement3; statement4; } counter1=1; counter2=1; counter3=0; counter4=0; Conflict between statement2 and statement4. Goal Profiling techniques to find bottlenecks (important conflicting locations) and why these conflicts happen.

Outline Profiling Techniques Implementation Case Studies 8

Profiling Techniques 9 Visualizing transactions Conflict point discovery Identifying conflicting data structures

Transaction Visualizer (Genome) 10 Aborts occur at the first and last atomic blocks in program order. Garbage Collection 14% Aborts Wait on barrier When these aborts happen?

Aborts Graph (Bayes) 11 AB1AB2 AB3 AB4 AB5 AB6 AB7 AB8 AB9 AB10AB12AB11AB13AB14AB15 93% Aborts 73%20%

Number of Aborts vs Wasted Work 12 atomic { counter++ } atomic { hashtable.Rehash(); } Aborts = 9 Aborts = 1 Wasted Work = 10% Wasted Work = 90%

Conflict Point Discovery 13 File:Line#Conf.MethodLine Hashtable.cs:51152AddIf (_container[hashCode]… Hashtable.cs:4862Adduint hashCode = HashSdbm(… Hashtable.cs:535Add_container[hashCode] = n … Hashtable.cs:835Addwhile (entry != null) … ArrayList.cs:793Containsfor (int i = 0; i < count; i++ ) ArrayList.cs:521Addif (count == capacity – 1) …

Conflicts Context 14 increment() { counter++; } probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } Thread for (int i = 0; i < 100; i++) { probability80(); probability20(); } Thread for (int i = 0; i < 100; i++) { probability80(); probability20(); } All conflicts happen here. Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

Identifying multiple conflicts from a single run 15 atomic { obj1.x = t1; obj2.x = t2; obj3.x = t3;... } atomic {... obj1.x = t1; obj2.x = t2; obj3.x = t3; } Thread 1Thread 2 Conflict detected at 1 st iteration Conflict detected at 2 nd iteration Conflict detected at 3 rd iteration

Identifying Conflicting Objects 16 List list = new List(); list.Add(1); list.Add(2); list.Add(3);... atomic { list.Replace(3, 33); } List123 0x080x100x180x20 GCDbgEng Object Addr 0x20 GC Root 0x08 Variable Name (list) Memory Allocator DbgEng Instr Addr 0x List.cs:1 Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) Replace (12%) Add (8%) Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) Replace (12%) Add (8%)

Outline Profiling Techniques Implementation -Bartok -The data that we collect -Probe effect and profiling Case Studies 17

Bartok C# to x86 research compiler with language level support for TM STM –Eager versioning (i.e. in place update) –Detects write-write conflicts eagerly (i.e. immediately) –Detects read-write conflicts lazily (i.e. at commit) –Detects conflicts at object granularity 18

Profiling Data That We Collect Timestamp –TX start, –TX commit or TX abort Read and write set size On abort –The instruction of the read and write operations involved in the conflict –The conflicting memory address –The call stack Process data offline or during GC 19

Probe Effect and Overheads 20 ThreadBayesGenomeIntruderLabyrinthVacationWormBench Normalized Abort Rates Normalized Execution Time ThreadBayesGenomeIntruderLabyrinthVacationWormBench Average Average 0.25

Outline Profiling Techniques Implementation Case Studies 21

Case Studies Bayes Intruder Labyrinth 22

Bayes 23 public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr; } Wrapper object for function arguments. FindBestTaskArg arg = new FindBestTaskArg(); arg.learnerPtr = learnerPtr; arg.queries = queries; arg.queryVectorPtr = queryVectorPtr; arg.parentQueryVectorPtr = parentQueryVectorPtr; arg.bitmapPtr = visitedBitmapPtr; arg.workQueuePtr = workQueuePtr; arg.aQueryVectorPtr = aQueryVectorPtr; arg.bQueryVectorPtr = bQueryVectorPtr; Create wrapper object.

Bayes 24 public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr; } FindBestTaskArg arg = new FindBestTaskArg(); arg.learnerPtr = learnerPtr; arg.queries = queries; arg.queryVectorPtr = queryVectorPtr; arg.parentQueryVectorPtr = parentQueryVectorPtr; arg.bitmapPtr = visitedBitmapPtr; arg.workQueuePtr = workQueuePtr; arg.aQueryVectorPtr = aQueryVectorPtr; arg.bQueryVectorPtr = bQueryVectorPtr; atomic { FindBestInsertTask(BestTaskArg arg) } Call the function using the wrapper object. Create wrapper object. 98% of wasted work is due to the wrapper object 2 threads – 24% execution time 4 threads – 80% execution time 98% of wasted work is due to the wrapper object 2 threads – 24% execution time 4 threads – 80% execution time

Bayes – Solution 25 atomic { FindBestInsertTaskArg ( toId, learnerPtr, queries, queryVectorPtr, parentQueryVectorPtr, numTotalParent, basePenalty, baseLogLikelihood, bitmapPtr, workQueuePtr, aQueryVectorPtr, bQueryVectorPtr, ); } Passed the arguments directly and avoid using wrapper object.

Intruder – Map Data Structure /3 3/1 6/2 4/3 6/3 2/4 6/4 Network Stream Assembled packet fragments

Network Stream Assembled packet fragments Intruder – Map Data Structure /3 3/1 6/2 4/3 6/3 2/4 6/4 Aborts caused 68% wasted work. Replaced with a chaining hashtable.

Intruder – Moving Code 28 Write-write conflicts are detected eagerly. More to roll back more wasted work atomic { Decoded decodedPtr = new Decoded(); char[] data = new char[length]; Array.Copy(packetPtr.Data, data, length); decodedPtr.flowId = flowId; decodedPtr.data = data; } this.decodedQueuePtr.Push(decodedPtr); Little to roll back, less wasted work

Labyrinth 29 atomic { localGrid.CopyFrom(globalGrid); if (this.PdoExpansion(myGrid, myExpansionQueue, src, dst)) { pointVector = PdoTraceback(grid, myGrid, dst, bendCost); success = true; raced = grid.addPathOfOffsets(pointVector); } 2 threads – 80% wasted work 4 threads – 98% wasted work 2 threads – 80% wasted work 4 threads – 98% wasted work Watson PACT’07, it is safe if localGrid is not up to date. Don’t instrument CopyFrom with transactional read and writes.

Summary Design principles –Abstract the underlying TM system –Report results at the source language constructs –Low instrumentation probe effect and overhead Profiling techniques –Visualizing transactions –Conflict point discovery –Identifying conflicting data structures 30

PPoPP’2010 Debugging Programs that use Atomic Blocks and Transactional Memory ICS’2009 QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory PPoPP’2008 Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server 31 Край