Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tuning Threaded Code with Intel® Parallel Amplifier.

Similar presentations


Presentation on theme: "Tuning Threaded Code with Intel® Parallel Amplifier."— Presentation transcript:

1 Tuning Threaded Code with Intel® Parallel Amplifier

2 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Objectives After successful completion of this module you will be able to… Use Parallel Amplifier to recognize and fix common performance problems in threaded applications Tuning Threaded Code with Intel® Parallel Amplifier

3 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Agenda Look at Intel® Parallel Amplifier features Examine Parallel Amplifier data views available Review common performance issues of multithreaded applications Focus on Load imbalance Focus on Synchronization contention Describe general optimizations to gain better performance Tuning Threaded Code with Intel® Parallel Amplifier

4 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Tuning Threaded Code with Intel® Parallel Amplifier Motivation Developing efficient multithreaded applications is hard New performance problems are caused by the interaction between concurrent threads Load imbalance Contention on synchronization objects Threading overhead

5 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Intel® Parallel Amplifier Debugging tool for threaded software Plug-in to Microsoft* Visual Studio* Identifies performance issues in OpenMP*, Intel® Threading Building Blocks, and Win32* threaded software Pinpoints performance bottlenecks that directly affect execution time Tuning Threaded Code with Intel® Parallel Amplifier

6 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Intel® Parallel Amplifier Features Integrated into Microsoft Visual Studio.NET* IDE 2005 & 2008 Editions Supports different compilers Microsoft* Visual* C++.NET* Intel Parallel Composer Binary instrumentation of applications Different views and filters available to assist and organize analysis Concurrency Analyzer – how well are core resources utilized? Locks and Waits Analyzer – where is program waiting? Tuning Threaded Code with Intel® Parallel Amplifier

7 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Getting Started with Parallel Amplifier Build application under the Release configuration Have debugging options set (/ZI /DEBUG) Use Multithreaded DLL for the Runtime Library (/MD) Ensure code is relocatable (/FIXED:NO) Use smaller, but representative workload Not too big since run time will increase Not too small in order to be sure performance problems will become evident Tuning Threaded Code with Intel® Parallel Amplifier

8 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Getting Started with Parallel Amplifier After successful build, choose type of analysis from Tools -> Parallel Amplifier menu Tuning Threaded Code with Intel® Parallel Amplifier

9 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Activity 1a Threaded version of number theory code Is there a performance issue? Goal Run application through Parallel Amplifier Concurrency Locks and Waits Examine thread activities by using different analysis tools and reviewing different views available from the tool Tuning Threaded Code with Intel® Parallel Amplifier

10 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Concurrency Views Concurrency Analysis pane defaults to show Bottom-up Organized by function Histogram displays amount of concurrency within a function during execution Colors used to distinguish level of concurrency for platform run on 10

11 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Concurrency Views Top-down Tree Shows call tree path to points within the execution and the concurrency levels of those functions. Note the highlighted section (blue) corresponds to the single function highlighted on the Bottom-up display 11

12 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Concurrency Views Source code Tab Shows source code from the selected function Concurrency from executed lines shown in “CPU Time” column 12

13 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Concurrency Views Summary Graph pane Text section summarizes Concurrency data at top of pane Graph section displays amount of time application spent executing with given levels of concurrency Average CPU Usage is measure of parallel efficiency Tooltips give details by hovering pointer over squares along graph 13

14 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Locks and Waits Views Locks and Waits Analysis pane defaults to show Bottom-up Organized by object Histogram displays amount of time threads experienced within a function during execution Colors used to distinguish level of concurrency 14

15 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Locks and Waits Views Top-down Tree Shows call tree path to points within the execution and the wait time caused by synchronization objects or I/O functions 15

16 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Parallel Amplifier: Locks and Waits Views Source code Tab Shows source code line(s) involved in critical regions of code protected by synchronization objects identified by Amplifier 16

17 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 17 Activity 1b Threaded version of number theory code Is there a performance issue? Goal Examine thread activities by reviewing different views Determine system utilization Identify any performance issues Tuning Threaded Code with Intel® Parallel Amplifier

18 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Common Performance Issues Load balance Improper distribution of parallel work Synchronization Excessive use of global data, contention for the same synchronization object Parallel Overhead Due to thread creation, scheduling.. Granularity Not sufficient parallel work Tuning Threaded Code with Intel® Parallel Amplifier

19 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Load Imbalance Unequal work loads lead to idle threads and wasted time Busy Idle Time Thread 0 Thread 1 Thread 2 Thread 3 Start threads Join threads Tuning Threaded Code with Intel® Parallel Amplifier

20 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Redistribute Work to Threads Static assignment Are the same number of tasks assigned to each thread? Do tasks take different processing time? Do tasks change in a predictable pattern? Rearrange (static) order of assignment to threads Use dynamic assignment of tasks Tuning Threaded Code with Intel® Parallel Amplifier

21 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 Redistribute Work to Threads Dynamic assignment Is there one big task being assigned? Break up large task to smaller parts Are small computations agglomerated into larger task? Adjust number of computations in a task More small computations into single task? Fewer small computations into single task? Bin packing heuristics Tuning Threaded Code with Intel® Parallel Amplifier

22 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Tuning Threaded Code with Intel® Parallel Amplifier Unbalanced Workloads

23 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 23 Tuning Threaded Code with Intel® Parallel Amplifier Unbalanced Workloads

24 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 24 Activity 2 – Load Imbalance Threaded version of potential code with thread pools Has a load balance performance issue Tuning Threaded Code with Intel® Parallel Amplifier

25 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 25 Synchronization By definition, synchronization serializes execution Lock contention means more idle time for threads Busy Idle In Critical Thread 0 Thread 1 Thread 2 Thread 3 Time Tuning Threaded Code with Intel® Parallel Amplifier

26 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 26 Synchronization Fixes Eliminate synchronization Expensive but necessary “evil” Use storage local to threads Use local variable for partial results, update global after local computations Allocate space on thread stack ( alloca ) Use thread-local storage API (TlsAlloc) Use atomic updates whenever possible Some global data updates can use atomic operations (Interlocked API family) Tuning Threaded Code with Intel® Parallel Amplifier

27 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 27 Atomic Updates Use OpenMP atomic constructs in place of critical regions, if possible static int counter; // Fast #pragma omp atomic counter++; // Slower #pragma omp critical (cLock) counter++; Tuning Threaded Code with Intel® Parallel Amplifier

28 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 28 Synchronization Fixes Reduce size of critical regions protected by synchronization object Larger critical regions tie up sync objects longer; other threads sit idle longer waiting to acquire objects Only accesses to shared variables need to be protected Tuning Threaded Code with Intel® Parallel Amplifier

29 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Object Contention From Locks and Waits analysis, choose object with poorest CPU utilization Double-click Sync Object Name to go to Source code line 29 Tuning Threaded Code with Intel® Parallel Amplifier

30 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 30 Activity 3 Threaded version of numerical integration Has serious performance issues Goal Understand thread activity Use the Parallel Amplifier to examine synchronization and its effect on performance Fix performance issue Tuning Threaded Code with Intel® Parallel Amplifier

31 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 31 General Optimizations Serial Optimizations Serial optimizations along the critical path should affect execution time Parallel Optimizations Reduce synchronization object contention Balance workload Functional parallelism Analyze benefit of increasing number of processors Analyze the effect of increasing the number of threads on scaling performance Tuning Threaded Code with Intel® Parallel Amplifier

32 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 32 Intel® Parallel Amplifier What’s Been Covered Identifying performance issues can be time consuming without tools Tools are required to understand and to optimize parallel efficiency and hardware utilization Parallel Amplifier helps you understand your applications thread activity, system utilization, and scaling performance Tuning Threaded Code with Intel® Parallel Amplifier

33 Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 33 Tuning Threaded Code: Intel® Parallel Amplifier for Explicit Threads


Download ppt "Tuning Threaded Code with Intel® Parallel Amplifier."

Similar presentations


Ads by Google