Ferad Zyulkyarov1,2, Srdjan Stipic1,2, Tim Harris3, Osman S. Unsal1,

Slides:

Advertisements

Similar presentations

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Advertisements

QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Ferad Zyulkyarov 1,2, Tim Harris 3, Osman S. Unsal 1, Adrián Cristal 1, Mateo Valero 1,2 1 BSC-Microsoft Research Centre 2 Universitat Politècnica de Catalunya.

CSCC69: Operating Systems

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman.

C# EMILEE KING. HISTORY OF C# In the late 1990’s Microsoft recognized the need to be able to develop applications that can run on multiple operating system.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

DYNAMIC MEMORY ALLOCATION. Disadvantages of ARRAYS MEMORY ALLOCATION OF ARRAY IS STATIC: Less resource utilization. For example: If the maximum elements.

11 Making Decisions in a Program Session 2.3. Session Overview  Introduce the idea of an algorithm  Show how a program can make logical decisions based.

Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.

Memory Management.

Lecture 20: Consistency Models, TM

Concepts of Object Oriented Programming

Names and Attributes Names are a key programming language feature

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

CS 326 Programming Languages, Concepts and Implementation

Introduction to Triggers

Faster Data Structures in Transactional Memory using Three Paths

Implementation of GPU based CCN Router

CNIT 133 Interactive Web Pags – JavaScript and AJAX

Multiprocessor Cache Coherency

Martin Rinard Laboratory for Computer Science

Java Review: Reference Types

User-Defined Functions

Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.

Stack Lesson xx This module shows you the basic elements of a type of linked list called a stack.

Capriccio – A Thread Model

Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.

Chapter 9 :: Subroutines and Control Abstraction

Lecture 19: Transactional Memories III

Chap. 8 :: Subroutines and Control Abstraction

Chap. 8 :: Subroutines and Control Abstraction

Transactions Sylvia Huang CS 157B.

Java Programming Arrays

Computer Architecture and the Fetch-Execute Cycle

Virtual Memory Hardware

Variables Title slide variables.

Parallel Computation Patterns (Reduction)

Classes and Objects.

Design and Implementation Issues for Atomicity

Trace-based Just-in-Time Type Specialization for Dynamic Languages

Data Structures & Algorithms

CSE 153 Design of Operating Systems Winter 19

Foundations and Definitions

Lecture: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Lu Tang , Qun Huang, Patrick P. C. Lee

Dynamic Performance Tuning of Word-Based Software Transactional Memory

UNIT -IV Transaction.

Classes and Objects Object Creation

10/18: Lecture Topics Using spatial locality

Dynamic Binary Translators and Instrumenters

Pointer analysis John Rollinson & Kaiyuan Li

CSE 542: Operating Systems

Dirty COW Race Condition Attack

CMSC 202 Constructors Version 9/10.

Lecture 3 – Data collection List ADT

Presentation transcript:

Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov1,2, Srdjan Stipic1,2, Tim Harris3, Osman S. Unsal1, Adrián Cristal1,4, Ibrahim Hur1, Mateo Valero1,2 In this presentation I will introduce techniques to profile transactional memory applications. These profiling techniques provide comprehensive and detailed information to help finding the bottlenecks in TM applications and understand the reason why these bottlenecks exist. 1BSC-Microsoft Research Centre 2Universitat Politècnica de Catalunya 3Microsoft Research Cambridge 4IIIA - Artificial Intelligence Research Institute CSIC - Spanish National Research Council 19th International Conference on Parallel Architectures and Compilation Techniques 11-15 September 2010 – Vienna

Abstract the TM Implementation Accesses to different arrays. We can observe overheads inherent to the TM implementation. Thread 1 Thread 2 We are not interested in such bottlenecks. for (i = 0; i < N; i++) { atomic x[i]++; } for (i = 0; i < N; i++) { atomic y[i]++; } First I want to explain what kind of bottlenecks we target. In the given example, 2 threads access different data and therefore there will be no conflicts. However, to guarantee atomicity, TM implementations incur overheads because they implicitly track transactionally accessed memory locations. Depending on design assumptions and the internal implementation of the TM system, the example code would have different performance on different TM implementations. We are not interested in studying such kind of overhead which is specific to the TM implementation.

Abstract the TM Implementation Accesses to the same arrays. Contention: Bottleneck common to all implementations of the TM programming model. Thread 1 Thread 2 We are interested in this kind of bottlenecks. for (i = 0; i < N; i++) { atomic x[i]++; } for (i = 0; i < N; i++) { atomic x[i]++; } This example is the same as the one before, but this time the threads access the same data which causes contention. Contention is a bottleneck specific to the TM programming model. No matter the underlying TM implementation a code with contention will always have poor performance. We are interested in this kind of bottlenecks. To improve the program performance, the programmer should find and understand the contention in the program.

Can We Find This Kind of Bottlenecks? atomic { statement1; statement2; statement3; statement4; } Where aborts happen? Which variables conflict? Are there false conflicts? Abort rate 80% To motivate our work I will continue with an example of profiling transactional application without the availability of profiling techniques. In an earlier work, published at ICS’2009 we have developed a QuakeTM – a transactional memory version of the Quake game server. Initially we had one very large atomic block which was executing the move operation in the game. This atomic block aborted many times – 80%. We wanted to know many things about the aborts such as the places where conflicts happen, the variables involved in conflicts, false conflicts etc. Howevever, we did not have access to the source code of the underlying TM implementation and we could not obtain such information from the TM implementation.

Can We Find This Kind of Bottlenecks? atomic { statement1; statement2; statement3; statement4; } counter1=0; counter2=0; We came up with an adhoc solution which gave us approximate results. We called this approach reach points. To understand the places where the transaction conflicts we have inserted inside the atomic block code that executes non-transactionally (i.e. variables are monitored for conflict and updates of these variables are not rolled back on abort). counter3=0; counter4=0;

Can We Find This Kind of Bottlenecks? atomic { statement1; statement2; statement3; statement4; } counter1=1; counter2=0; If no abort happens when the control reaches to a given counter, the counter is incremented. counter3=0; counter4=0;

Can We Find This Kind of Bottlenecks? atomic { statement1; statement2; statement3; statement4; } Conflict between statement2 and statement4. counter1=1; counter2=1; Goal Profiling techniques to find bottlenecks (important conflicting locations) and why these conflicts happen. If abort happens, the transaction rolls back. But the values of the counters stay untouched because they are not monitored by the TM system. By looking at the difference between two counters we can tell at which regions transactions have aborted. For example, in this case we can tell that conflict happens between statements 2 and 4, or when executing statement 3. However, finding such conflicting regions is not all about the bottlenecks in TM applications. The goal of our work was to find profiling techniques that would show not only the conflicting locations but give detailed and comprehensive information to understand the reasons for the conflicts in TM applications. counter3=0; counter4=0;

Outline Profiling Techniques Implementation Case Studies

Profiling Techniques Visualizing transactions Conflict point discovery Identifying conflicting data structures We have found series of profiling techniques which we classify in three categories. We describe each at the next slides. Visualization Conflict Point Abort vs wasted work Identifying conflicting Objects Static Dynamic Implementation GC Data we collect Bartok Case studies

Transaction Visualizer (Genome) When these aborts happen? 14% Aborts Garbage Collection Wait on barrier Transaction visualizer shows how transactions execute in time. It is particularly useful to find at which parts of the program execution conflicts (aborts) happen. Simply reporting an average of 20% aborts does not tell us whether these aborts were uniformly distributed throughout the program execution or they happened just at the beginning or end of the program execution. In this example from Genome, we can see that most of the aborts happen at the beginning and the end of the program execution. There is garbage collection when all threads are suspended And there are 2 barriers where threads wait Aborts occur at the first and last atomic blocks in program order.

Aborts Graph (Bayes) 73% 20% 93% Aborts AB5 AB14 AB13 AB10 AB1 AB2 AB8 Example from Bayes. AB3 is aborted by AB1 and AB2. Bayes have 15 atomic blocks and only one of them aborts most (AB3). To be able to better understand the reason for the aborts and then optimize the code the programmer needs to know the atomic blocks which cause AB3 to abort. Eventually we can see that AB3 is read-only TX and is the longest transactions. 73% 20% AB4 AB11 93% Aborts AB15 AB12

Number of Aborts vs Wasted Work atomic { counter++ } atomic { hashtable.Rehash(); } We want somehow to quantify how important is the impact of the bottlenecks in the program. This slide explains two metrics - # of aborts and wasted work – used to quantify the importance of bottlenecks and how they are different. On the left we have an atomic block which increments a counter and on the right we have an atomic block which rehashes a hashtable. The left atomic block aborts 9 times and the right one 1 time. Incrementing a counter is a fast operation and causes very little wasted work although the many aborts. But on the other side, rehashing the hashtable is a long operation and although the single abort it causes 90% wasted work. In our profiling techniques we emphasize on reporting results over wasted work while at the same time reporting # of aborts.

Conflict Point Discovery File:Line #Conf. Method Line Hashtable.cs:51 152 Add If (_container[hashCode]… Hashtable.cs:48 62 uint hashCode = HashSdbm(… Hashtable.cs:53 5 _container[hashCode] = n … Hashtable.cs:83 while (entry != null) … ArrayList.cs:79 3 Contains for (int i = 0; i < count; i++ ) ArrayList.cs:52 1 if (count == capacity – 1) … In an earlier work published at PPoPP’2010 we have introduced a mechanism to find conflicting statements within the atomic block and also functions called inside atomic blocks. Unlike the motivating example from the first slides this approach does not require modifying the program code, it is more precise and reports results a statement granularity, and also reports the conflicting statements from functions called directly or indirectly inside an atomic block. Simple example of conflict point discovery. It is not necessary here to discuss details such as multiple conflict discovery, wasted work, contextual information.

All conflicts happen here. Conflicts Context increment() { counter++; } probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } All conflicts happen here. Thread 1 ------------ for (int i = 0; i < 100; i++) { probability80(); probability20(); } Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%) probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } Viewing the places where transactions conflict is not enough. To understand why conflicts happen, in large applications with many function calls within atomic blocks, is essential to provide contextual information for each conflict. In this example we have a method: increment – which increments a shared counter probabilty20 – which calls increment inside atomic block with probability 20% probabilty80 – which calls increment inside atomic block with probability 80% 4) We have 2 threads which call probability20 and probability80 The conflict point discovery alone will report that all conflicts happen in method increment. However, adding contextual information about the methods which called increment gives much better insight about the “hot” control flows which the programmer has to focus on. Thread 2 ------------ for (int i = 0; i < 100; i++) { probability80(); probability20(); }

Identifying multiple conflicts from a single run Conflict detected at 1st iteration Thread 1 Conflict detected at 2nd iteration Thread 2 Conflict detected at 3rd iteration atomic { obj1.x = t1; obj2.x = t2; obj3.x = t3; ... } atomic { ... obj1.x = t1; obj2.x = t2; obj3.x = t3; } In our profiling framework we implement a mechanism to detect multiple conflicts from a single run. Some TM implementations abort the transaction immediately when a conflict is detected. In such case, the profiling framework may miss reporting for conflicts that exist and would appear later in program execution. In this example, the two threads update the same set of objects. Accesses to all these objects are conflicting. However, in a naïve implementation the profiling results will show that conflicts happen only at obj1. When the programmer eliminates the conflict, then the conflicts will show up at obj2 and then in obj3. In our implementation, the profiling framework will encounter all these conflicts from a single run.

Identifying Conflicting Objects List list = new List(); list.Add(1); list.Add(2); list.Add(3); ... atomic { list.Replace(3, 33); } List 1 2 3 0x08 0x10 0x18 0x20 Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%) We identify conflicting objects from the different places in the code where they have been involved in conflict. 1) In this example, we replace the third entry in the linked list with 33. 2) When conflict occurs, we log the address of the object involved in the conflict 0x20 3) We pass this address to the GC and we find the GC root. If the GC root is a static object we can immediately translate it to a variable name using the debugger engine -> List 4) If the GC root is not a GC root, we pass the address to the memory allocator and find the instruction which have allocated the memory. 5) Then using the debugger engine we translate the instruction to a source line Line.cs:1 6) We summarize the conflicts in a per-object tree view which includes information about the control flow which led to the conflict. Object Addr 0x20 GC Root 0x08 GC DbgEng Variable Name (list) Instr Addr 0x446290 Memory Allocator DbgEng List.cs:1

Outline Profiling Techniques Implementation Case Studies Bartok The data that we collect Probe effect and profiling Case Studies

Bartok C# to x86 research compiler with language level support for TM STM Eager versioning (i.e. in place update) Detects write-write conflicts eagerly (i.e. immediately) Detects read-write conflicts lazily (i.e. at commit) Detects conflicts at object granularity The STM is eager versioning. It updates the variable in places and logs the original value for roll back in case an abort happens. Detects write write conflicts immediately. Detects read-write conflicts at commit time. Detects conflicts at object granularity. If two threads access different fields of the same object and one of the accesses is write, this will be considered as a conflict.

Profiling Data That We Collect Timestamp TX start, TX commit or TX abort Read and write set size On abort The instruction of the read and write operations involved in the conflict The conflicting memory address The call stack Process data offline or during GC … To have low overhead and probe effect, we process data offline or during GC when the caches are already altered.

Probe Effect and Overheads Normalized Abort Rates Thread Bayes Genome Intruder Labyrinth Vacation WormBench 2 0.00 4 0.11 0.01 8 0.12 0.02 Average 0.016 Normalized Execution Time The table above displays the abort rate of our benchmarks when profiling is enabled. Results are normalized to version when profiling is disabled. Green color indicates small difference and red color indicates higher difference. We use the abort rate as an indication for a probe effect – meaning that the profiling information does not change the conflict patters in the program. The average change for all benchmarks is within 0.16. These findings suggest that the profiling does not perturb the program behaves. The table below displays the normalized execution time of our benchmarks with profiling enabled and profiling disabled. The purpose of this table is to show the overhead of the profiling instrumentation. Because we used per-thread local memory to store the statistical data, our profiling is not a scalability bottleneck for the applications. The average for all benchmarks is with profiling enabled is within 25% of the version without profiling Thread Bayes Genome Intruder Labyrinth Vacation WormBench 1 0.59 0.27 0.29 0.07 0.26 2 0.45 0.30 0.39 0.03 0.24 0.05 4 0.01 0.21 0.55 0.18 0.08 8 0.02 1.19 0.16 0.19 0.11 Average 0.25

Outline Profiling Techniques Implementation Case Studies

Case Studies Bayes Intruder Labyrinth In this experiment we used the C# version of the STAMP applications. These applications were ported from C to C# in a direct manner by replacing TX-START and TX-END with the available language constructs. In the original C implementation, the memory operations inside atomic blocks were manually instrumented with calls to the STM library. In our case, Bartok automatically instruments these calls.

Wrapper object for function arguments. Bayes Create wrapper object. Wrapper object for function arguments. public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr; } FindBestTaskArg arg = new FindBestTaskArg(); arg.learnerPtr = learnerPtr; arg.queries = queries; arg.queryVectorPtr = queryVectorPtr; arg.parentQueryVectorPtr = parentQueryVectorPtr; arg.bitmapPtr = visitedBitmapPtr; arg.workQueuePtr = workQueuePtr; arg.aQueryVectorPtr = aQueryVectorPtr; arg.bQueryVectorPtr = bQueryVectorPtr; (press) Bayes used a wrapper object to encapsulate number of arguments that are passed to a function. (press) The user first creates a wrapper object, initializes its fields and passes it as an argument to a function which is executed atomically.

Bayes Create wrapper object. 98% of wasted work is due to the wrapper object 2 threads – 24% execution time 4 threads – 80% execution time public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr; } FindBestTaskArg arg = new FindBestTaskArg(); arg.learnerPtr = learnerPtr; arg.queries = queries; arg.queryVectorPtr = queryVectorPtr; arg.parentQueryVectorPtr = parentQueryVectorPtr; arg.bitmapPtr = visitedBitmapPtr; arg.workQueuePtr = workQueuePtr; arg.aQueryVectorPtr = aQueryVectorPtr; arg.bQueryVectorPtr = bQueryVectorPtr; atomic { FindBestInsertTask(BestTaskArg arg) } Because our STM detects conflicts at object granularity, any write access to one of the fields of this object was causing a conflict. From a single profiling we have found out that this wrapper object is involved in almost all conflicts (98%) and the wasted work caused by these conflicts amounts to 24% and 80% of the program execution, respectively with 2 and 4 threads. Call the function using the wrapper object.

Passed the arguments directly and avoid using wrapper object. Bayes – Solution Passed the arguments directly and avoid using wrapper object. atomic { FindBestInsertTaskArg ( toId, learnerPtr, queries, queryVectorPtr, parentQueryVectorPtr, numTotalParent, basePenalty, baseLogLikelihood, bitmapPtr, workQueuePtr, aQueryVectorPtr, bQueryVectorPtr, ); } We avoided using a wrapper object and instead passed the function arguments explicitly. STAMP authors have used alignment for the BestTaskArg struct to avoid false conflicts.

Intruder – Map Data Structure Network Stream 6/4 Assembled packet fragments 2/4 6/3 1 1 2 4 6/2 2 3/1 2 3 4/3 3 1/3 4 1 2 5 Intruder implements an intrusion detection algorithm. Packet fragments are taken from a network stream and using a map data structure assembled into complete packets. 6 1

Intruder – Map Data Structure Network Stream 6/4 Replaced with a chaining hashtable. Assembled packet fragments 2/4 6/3 1 1 2 1/3 4 6/2 Aborts caused 68% wasted work. 2 2 3 3 3/1 4 1 2 4/3 5 The map data structure is implemented with red black.(press) Using our techniques we identified that that the object causing most conflicts was the red black tree amounting to 67.6% of the wasted work. (press) We have replaced it with hash table. Because the red black tree is traversed from the top and also additional operations are performed to preserve its properties, atomic operations on it have larger read and write sets compared to the hashtable. 6 1

Intruder – Moving Code Little to roll back, less wasted work More to roll back more wasted work atomic { Decoded decodedPtr = new Decoded(); char[] data = new char[length]; Array.Copy(packetPtr.Data, data, length); decodedPtr.flowId = flowId; decodedPtr.data = data; } The code example is a part of an atomic block in Intruder. (press) The last statement inserts an assembled packet into a queue to be examined for malicious patterns. (press) Using conflict point discovery, we have seen that many conflicts which happen at this statement cause significant wasted work. (In our STM, write-write conflicts are detected eagerly – at the time when they appear. (press)) In this case, when a conflict is detected at the last line of the atomic block, there will be a lot of speculative state that has to be rolled back. Rollbacks in eager versioning STMs as our STM is are expensive and will cause significant wasted work. (press) However, if the same operation is at the beginning of the atomic block, there will be no speculative state to rollback and cheaper aborts. this.decodedQueuePtr.Push(decodedPtr); Write-write conflicts are detected eagerly.

Labyrinth Watson PACT’07, it is safe if localGrid is not up to date. 2 threads – 80% wasted work 4 threads – 98% wasted work atomic { localGrid.CopyFrom(globalGrid); if (this.PdoExpansion(myGrid, myExpansionQueue, src, dst)) pointVector = PdoTraceback(grid, myGrid, dst, bendCost); success = true; raced = grid.addPathOfOffsets(pointVector); } Don’t instrument CopyFrom with transactional read and writes. Labyrinth implements a version of Lee’s path routing algorithm. Conflict point discovery has showed that with 2 threads 80% and with 4 threads 98% of the wasted work happens when a global shared grid is copied to a local gird. (press) Watson et al. [PACT’07] has studied the TM implementation of this algorithm and observed that it is safe to perform the grid copy operation non-transactionally. (press) Knowing this we have instructed our compiler to not instrument a transactional version for the grid copy method.

Summary Design principles Profiling techniques Abstract the underlying TM system Report results at the source language constructs Low instrumentation probe effect and overhead Profiling techniques Visualizing transactions Conflict point discovery Identifying conflicting data structures

PPoPP’2010 Debugging Programs that use Atomic Blocks and Transactional Memory ICS’2009 QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory PPoPP’2008 Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server Край For those who might be interested, I put references to my earlier work. My 4th year in PhD has just finished. After returning from the conference, I am about to start writing the thesis and also looking for what to do next. Any offline discussion about that would be very useful for me.