Download presentation
Presentation is loading. Please wait.
Published byAmanda Rodgers Modified over 6 years ago
1
Tuning Threading Code with Intel® Thread Profiler for Explicit Threads
Intel Software College
2
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Objectives After successful completion of this module you will be able to… Use Thread Profiler to recognize and fix common performance problems in applications using Windows* threads Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Standard objectives slide. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
3
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Agenda Look at Intel® Thread Profiler features Define Critical Path Analysis Examine Thread Profiler data views available Review common performance issues of multithreaded applications Focus on Load imbalance Focus on Synchronization contention Describe general optimizations to gain better performance Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Put forth the agenda of topics to be covered in this module. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
4
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Motivation Developing efficient multithreaded applications is hard New performance problems are caused by the interaction between concurrent threads Load imbalance Contention on synchronization objects Threading overhead Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Provide some motivation to studying the Intel Thread Profiler. Details Not only is it hard to get correctly threaded applications, but efficient applications are hard to produce by hand without help. The list is some of the performance problems posed by multithreded apps that are not found in serial codes. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
5
Intel® Thread Profiler
Plugs in to the VTune™ performance environment Instrumentation-based data collector in VTune Identifies performance issues in OpenMP* or threaded applications using the Win32* API and POSIX* threads Pinpoints performance bottlenecks that directly affect execution time Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Provide some background and introduction to the Intel Thread Profiler. Details Works on the three most prevalent threading models. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
6
Intel® Thread Profiler Features
Supports several different compilers Intel® C++ and Fortran Compilers, v7 and higher Microsoft* Visual* C++, v6 Microsoft* Visual* C++ .NET* 2002, 2003 & 2005 Editions Integrated into Microsoft Visual Studio .NET* IDE Binary instrumentation of applications Different views and filters available to assist and organize analysis Uses critical path analysis Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Provide information on supported compilers and some of the features of the tool. Details The critical path concept used within Thread Profiler is not the same as it is used within VTune. Critical path in TP will be defined and described in the next few slides. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
7
What is the Critical Path?
Threaded applications contain multiple execution flows A new flow is created when a thread is created or resumes Flow ends when a thread terminates or blocks on a synchronization primitive Thread 3 terminates Thread 1 Thread 2 Thread 3 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Acquire lock L Release L Wait for L Acquire L Thread 2 terminates Wait for L Acquire L Release L Thread 1 terminates Wait for Threads 2 & 3 Threads 2 & 3 Done Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Define the concept of critical path as used within Thread Profiler. Details With OpenMP programs, there is a very controlled shift between serial and parallel regions within the app. Thus, it is very straightforward to focus on the parallel portions of a code and zero-in on performance problems. Code that has been threaded with an explicit model will likely not have a very regular or predictable pattern of parallelism. Thus, a new method of analysis for performance of explicitly threaded apps is needed. The designers of TP came up with Critical Path Analysis. To understand critical path, students must understand an “execution flow” Build points: 1. All labels pop up. Describe what the graph is showing about the three threads execution and their interactions with each other, especially with respect to the lock. 2. Labels disappear and first execution flow is drawn. Explain why this is an execution flow. 3-7. All of the other exection flows are drawn one at time. Explain what causes the termination of the flow. Box pops up at the endpoint of each flow. 8. Definition box of “critical path” is displayed. The critical path is the longest execution flow Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
8
Critical Path Analysis
System Utilization Relative to the system executing the application Idle: no threads Serial: a single thread Under Utilized: more than one thread, less than cores Fully Utilized: # threads == # cores Over Utilized: # threads > # cores Thread interaction categories Cruise: threads running without interference Overhead: thread operation overhead Blocking: thread waiting on external event Impact: thread preventing some other thread from executing Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Define and describe the two types of information that are kept during a threaded run and associated with the critical path. Details Build points: 1. System utilization is the number of threads that are active. 2. System utilization categories 3. Second type of information kept is how the threads are interacting with each, especially with respect to synchronization objects and waiting for thread termination. 4. Thread interaction categories 5. This is why we worry about the critical path. Background (Include any support material, stories, anecdotes, or references that might be helpful to the instructor.) <Remove these lines and start typing here; retain the sub-heading.> Questions to Ask Students (Include any specific questions (and answers) that the instructor could/should ask to enhance understanding.) Transition Quote (Suggested dialog that can be used by the instructor to segue into the next slide.) If the critical path is shortened, the application will run in less time Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
9
System Utilization Examines processor utilization to determine concurrency level of the application Concurrency is the number of active threads Categorization shown for a system configuration with 2 processors Idle Serial Fully Utilized Under Utilized Over Utilized Concurrency Level 15 5 10 Time Thread 1 Thread 2 Thread 3 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Acquire lock L Wait for Threads 2 & 3 Wait for L Release L Acquire L Threads 2 & 3Done Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Demonstrate how Thread Profiler gathers system utilization (concurrency level) data along the critical path. Details A thread is “active” if it is ready to be run, but may not have sufficient processor resources. This will account for the over utilized case when there are more threads ready to run, but not enough cores. Build Points: 2-10. Another portion of the critical path will be marked with the concurrency level. Slightly delayed, the histogram on the right will be build up for each point in the build process. Describe how each concurrency level is determined on the 2 core system being simulated. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
10
Execution Time Categories
Analyze thread interaction and behavior along critical path Record objects that cause CP transitions Categorization shown for a system configuration with 2 processors Cruise time Overhead Blocking time Impact time Thread Interaction 15 5 10 Time Thread 1 Thread 2 Thread 3 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Acquire lock L Wait for Threads 2 & 3 Wait for L Release L Acquire L Threads 2 & 3 Done Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Demonstrate how Thread Profiler gathers thread interaction data along the critical path. Details A thread is “active” if it is ready to be run, but may not have sufficient processor resources. This will account for the oversubscribed case when there are more threads ready to run, but not enough cores. Build Points: 2-10. Another portion of the critical path will be marked with the thread interaction category. Slightly delayed, the histogram on the right will be build up for each point in the build process. Describe how each interaction category is determined on the 2 core system being simulated. Note also, that the object involved as the critical path transition from one thread to the other is recorded. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
11
Merging Concurrency and Behavior
Concurrency Level Critical Path Thread Behavior 15 5 10 Time Start with system utilization Further categorize by behavior Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Demonstrate how Thread Profiler combines system utilization and thread interaction. Details The legend offers categories that are a cross product of concurrency and behavior. Thus, the two data sets are merged to come up with the profile (summary) historgram. If you go back to compare the two “timeline” graphs in the previuos two slides, you will find that all Impact time was during serial execution (4 & 5). Also, all overhead time was also during serial execution (6). Thus, the rest of the cruise time happened during the rest of the concurrency levels. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
12
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Thread Profiler Views Critical Path View Shows breakdown of the critical path Profile View Shows the breakdown of selected critical paths User can select other views of the selected profile Concurrency level, threads, objects Timeline View Shows thread activity and critical path transitions for the entire application Source View Transition source view, creation source view Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Deliver a quick preview and description of the different views that are available to examine critical path data. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
13
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Activity 1a Threaded version of potential code Is there a performance issue? Goal Run application through Thread Profiler Examine thread activities by reviewing different views Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Introduce the first part of the First Lab Activity. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
14
Thread Profiler Proflie View
Profile Pane Timeline Pane Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Describe the panes displayed in the Profile View within Thread Profiler. Details This slide and the next five slides show the views that students should have seen in the previous activity. This is the chance to delve into those views and explain what data is being displayed and how to use each view. This is the default view after Thread Profiler has finished execution of the application. Builds show the names of each pane to point out that both Profile and Timeline Views are seen initially. Then, the expansion buttons in the upper right of each pane are shown. Transition Quote “Let’s take a look at the Profile View and groupings first.” Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
15
Profile Pane – Concurrency Level View
Let’s look at the Thread View Ran single threaded ~65% of the time Two threads ran in parallel ~33% of the time Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Describe and show features of the Profile Pane – Concurrency Level filter. Details This is the default filter applied to the Critical Path at the initial visit to the Profile View. This filter shows the concurrency level data of the critical path. Build points: 1. Tab of Profile View to jump back at any time 2. Shows icon to get Concurrency Level filter if this is not displayed. 3-6 Tooltips for each different segment of the histograms appear when hovering the mouse. 7-8 Visual inspection of the histogram sizes can be used to approximate the amount of serial time versus parallel time. 9. Icon to use to get to the Threads View Transition Quote “Let’s look at the Threads View” (final build on slide) Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
16
Profile Pane – Thread View
Let’s look at the Object View Lifetime of the thread Active time of the thread Time on the Critical Path Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Describe and show features of the Profile Pane – Threads filter. Details The Threads View filter shows life time (creation to termination), active time (ready or actively running), and amount of time spent on the critical by each thread used in the application. Build Points: 1. Tooltip appears when hovering over label at bottom of histogram. Information is about the thread. Since threads are arbitrarily numbered by TP, you can tell where the thread was created from tooltip info. Students should notice three “shadows” shown in the display pane. 2. The lighter shaow is the life time of the thread from creation to termination. 3. The darker shadow is the (cumulative) active time of the thread 4. The multi-colored part is the amount of time the thread spent on the Critical Path with categories of concurrency X thread interaction Note that the “shadows” all start from 0. That is, imagine that there are three different histograms stacked one in front of the other. The Thread 1 Active time was spent all on the Critical Path, thus, there appears to be no Active time shadow. The Active Time of Thread 1 is all within the Critical Path since this was the only thread available to run when it was running. 5. Clear last box 6. Point to Objects Icon. Transition Quote “Let’s look at the Critical Path organized by synchronization objects used within the application.” (build 6) Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
17
Profile Pane – Object View
Let’s look at Timeline View This object caused all of the impact Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Describe and show features of the Profile Pane – Object View filter. Details The Objects View filter shows the synchronization objects that were used in the execution of the application. The histogram bars for each object detail the amount of time the object was involved on the Critical Path. Look for those objects that had impact time associated with them. These objects are the ones that could, if modified, reduce the overall critical path time and lower the total execution time of the application. Build Points: 1. All of the impact time along the Critical Path is associated with a single object. “What is this object?” 2. Tooltip shows information about object. Creation location within code is given to better identify the specific sync object (assuming there is more than one of the same type). 3. Point to the Timeline View Tab to get to that display Transition Quote “Let’s look at the Timeline view in Thread Profiler” (Build 3) Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
18
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Timeline Pane Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Describe and show features of the Timeline View. Details This view shows threads on the left and how they executed across the Y-axis from start to finish. Build Points: 0: Timeline Tab is circled. This is how to navigate to this view. 1: Remove circle. 2: Tooltip on threads to get more info on which thread is which. 3: remove tooltip The view here is densely packed and is hard to see any detail, so there is a way to zoom in to get finer detail of the execution. 4: Zoom in by clicking and dragging 5: Can keep zooming in until a level with the appropriate level of detail is reached. From here the legend on the right can be seen and the different colors and their meaning pointed out. The yellow lines are the transition of the critical path from one thread to another. 6: Hover mouse on transition to get tooltip details about the transition. 7: Right click on transition line to get context-sensitive menu and access to the “Transition Source View” command (results shown on next slide) Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
19
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Source View Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Describe and show features of the (Transition) Source View. Details This is the result of the “Transition Source View” command from the previous slide. It shows the point in the source code where the Critical Path was transition from one thread to another. The circled tab can be used to navigate back to this view in future. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
20
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Activity 1b Threaded version of potential code Is there a performance issue? Goal Examine thread activities by reviewing different views Determine system utilization Identify any performance issues Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Introduce the second part of the First Lab Activity. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
21
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Review Activity 1 Concurrency Level view can be used to determine system utilization by the application Timeline view enables you to understand the thread activity in your application Instrumentation time will be included in first run results; thus, for applications running in a short amount of time, a second run may produce more realistic timings. Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Review the salient points that were demonstrated in the first lab activity. Details Note that for large applications, the time to instrument the code will be added to the results the first time. Thus, the user may wish to ignore the first set of results, run the code through TP again and use the second results. The instrumentation will have already been don, so there will be no time lost to instrumentation in the second run. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
22
Common Performance Issues
Load balance Improper distribution of parallel work Synchronization Excessive use of global data, contention for the same synchronization object Parallel Overhead Due to thread creation, scheduling.. Granularity No sufficient parallel work Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Introduce (review) some of the common performance issues that can be found in multithreaded applications. Transition Quote “So let’s take a more detailed look at the first two of these issues. First up will be Load Imbalance.” Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
23
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Load Imbalance Unequal work loads lead to idle threads and wasted time Busy Idle Time Thread 0 Thread 1 Thread 2 Thread 3 Start threads Join threads Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Give a graphical example of why load imbalance is a bad thing. Transition Quote “If you know that you have a load balance problem, how can you fix it?” Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
24
Redistribute Work to Threads
Static assignment Are the same number of tasks assigned to each thread? Do tasks take different processing time? Do tasks change in a predictable pattern? Rearrange (static) order of assignment to threads Use dynamic assignment of tasks Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Give some ideas about how to fix a load imbalance when the work is statically assigned within the application. Details The obvious fix to a load imbalance is to redistribute the work. What if a static assignment of tasks is used? The questions on the slide should be asked to guide the programmer to finding a better distribution. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
25
Redistribute Work to Threads
Dynamic assignment Is there one big task being assigned? Break up large task to smaller parts Are small computations agglomerated into larger task? Adjust number of computations in a task More small computations into single task? Fewer small computations into single task? Bin packing heuristics Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Give some ideas about how to fix a load imbalance when the work is dynamically assigned within the application. Details If a dynamic distribution of tasks is used in the code, how should the work be redistributed? The questions and suggestions on the slide should be asked to guide the programmer to finding a better distribution. Bin packing heuristics try to take a set of objects and fill fixed size bins with those objects such that each bin contains roughly the same amount of things (by weight, by size, etc.) Transition Quote “Now that we understand what modifications to the code we might need to do, how does Thread Profiler show us when an application has a load imbalance?” Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
26
Threads are unbalanced
Unbalanced Workloads Threads are unbalanced Active Times not equal Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Show how Load Imbalance between thread workloads can be identified from Thread Profiler. Details Build Details: 0: Flashing circle on the Threads View icon will blink several times. This is the View used to identify if there is a load imbalance. 1: We assert that the four worker threads have differing amounts of work to perform. Ask QUESTION from below. 2. Draw “stair step” pattern across the Active “shadow” in this view. Different amounts of active time == different amounts of work assigned to threads. Questions to Ask Students Q: Does the amount of time a thread has spent on the Critical Path demonstrate a load imbalance? A: No, the amount of time on the critical path by any thread for any given run is no indication of the amount of work done by the thread. It may be just dumb luck as to how much time a thread is found to have spent on the Critical Path. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
27
Activity 2 – Load Imbalance
Threaded version of potential code with thread pools Has a load balance performance issue Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Introduce the Second Lab Activity. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
28
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Review Activity 2 Threads view can be used to determine activity levels of each thread within the application Timeline view enables you to understand the thread activity in your application Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Review the lessons of the Second Lab Activity. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
29
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Synchronization By definition, synchronization serializes execution Lock contention means more idle time for threads Busy Idle In Critical Thread 0 Thread 1 Thread 2 Thread 3 Time Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Give a graphical example of why synchronization contention is a bad thing. Transition Quote “If you know that you have a synchronization problem, how can you fix it?” Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
30
Synchronization Fixes
Eliminate synchronization Expensive but necessary “evil” Use storage local to threads Use local variable for partial results, update global after local computations Allocate space on thread stack (alloca) Use thread-local storage API (TlsAlloc) Use atomic updates whenever possible Some global data updates can use atomic operations (Interlocked API family) Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Provide some potential solutions to eliminating the need for synchronization objects. Details The best solution would be to eliminate all synchronization, but this is likely not going to be possible in order to keep the application correct. Alternatives to explicit synchronization are use of local storage or replace with atomic operations. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
31
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Atomic Updates Use Win32 Interlocked* intrinsics in place of synchronization object static long counter; // Fast InterlockedIncrement (&counter); // Slower EnterCriticalSection (&cs); counter++; LeaveCriticalSection (&cs); Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Review use of Interlocked* operations as atomic operations Details The code segments are equivalent, but the first will execute much faster than the second. Background The basic Interlocked* intrinsics includes: InterlockedAdd, InterlockedAnd, InterlockedBitTestAndReset, InterlockedCompareExchange, InterlockedDecrement, InterlockedExchange, InterlockedIncrement, InterlockedOr, and InterlockedOr. Some of these have several variations on bit-size for operands and memory access semantics. See for more details. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
32
Synchronization Fixes
Reduce size of critical regions protected by synchronization object Larger critical regions tie up sync objects longer; other threads sit idle longer waiting to acquire objects Only accesses to shared variables need to be protected Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Provide potential solutions to reducing the time spent in critical regions of code. Details If there are only 2 lines that need mutual exclusion protection within a set of 25 lines, rather than protect the entire block, focus the synchronization on the two required lines. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
33
Synchronization Fixes
Use best synchronization object for job Critical Section Local object Available to threads within the same process Lower overhead (~8X faster than mutex) Mutex Kernel object Accessible to threads within different processes Deadlock safety (can only be released by owner) Other objects are available Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Provide advice on being sure to use the appropriate synchronization object. Details This slide gives a comparison of two similar objects. Each has advantages and disadvantages, depending upon the requirements and the use. Transition Quote “Now that we know how we might fix a synchronization contention, let’s see how to use Thread Profiler to identify when synchronization object contention may be a performance bottleneck.” Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
34
Object Contention What is all this? These four threads…
This object caused all of the impact Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Demonstrate how synchronization object contention is identified in TP; and how to use the second level filtering can be done. Details Build Points: 0: From the Objects View 1: We see that all of the impact time has been attributed to one CRITICAL_SECTION object. 2: We can get to the object view from the blobby icon; and we see that the grouping tells us what filter is being displayed. 3-4: Besides the blobby icon, we can click on the [1 icon and use the pull down menu to get to any view under the Profile tab. Q: We may have the case where one or a few threads are causing all the impact time. If so, we can focus on what those threads are doing and make any fixes there. How can we tell which threads were involved with using the given sync object to impact other threads? 5-6: Click the [[2 icon to set up a second level grouping within the first grouping, for example choosing threads shows… 7:Object use subdivided by each thread. (Note the Grouping indicator at the top of the pane now shows both filters and the order that they are applied. 8-9: All four of the threads are impacting other threads (each other in this case), with CS 11. …are impacting threads by this object Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
35
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Activity 3 Threaded version of numerical integration Has serious performance issues Goal Understand thread activity Use the Thread Profiler groupings Examine synchronization and its effect on performance Fix performance issue Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Introduce the Third Lab Activity. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
36
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Review Activity 3 Grouping objects and threads provides the information on which objects impact what threads Apply the heuristics from labs for locating bottlenecks in the source code For longer running applications, the difference in first and second run- times is negligible Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Review the lessons of the Third Lab Activity. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
37
General Optimizations
Serial Optimizations Serial optimizations along the critical path should affect execution time Parallel Optimizations Reduce synchronization object contention Balance workload Functional parallelism Analyze benefit of increasing number of processors Analyze the effect of increasing the number of threads on scaling performance Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Present advice on general tuning and some specific tuning advice for parallel applications. Details Always do serial optimizations and library substitution before starting to thread applications. It is embarrassing to have spent several months threading a code only to make some serial modifications that then make all the threading effort slow the code down because the granularity of the computations has been severely reduced. What about scaling? How will more cores affect execution time, especially in the face of increased data set sizes? Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
38
Intel® Thread Profiler for Explicit Threads What’s Been Covered
Identifying performance issues can be time consuming without tools Tools are required to understand and to optimize parallel efficiency and hardware utilization Thread Profiler helps you understand your applications thread activity, system utilization, and scaling performance Multi-core Programming: Intel Thread Profiler for Explicit Threads Speaker’s Notes Purpose of the Slide Summary slide of what has been covered in modlue. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
39
Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
This should always be the last slide of all presentations. Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.