Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Slides:



Advertisements
Similar presentations
Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.
Advertisements

Intel® performance analyze tools Nikita Panov Idrisov Renat.
Bilgisayar Mühendisliği Bölümü GYTE - Bilgisayar Mühendisliği Bölümü Multithreading the SunOS Kernel J. R. Eykholt, S. R. Kleiman, S. Barton, R. Faulkner,
CHAPTER 5 THREADS & MULTITHREADING 1. Single and Multithreaded Processes 2.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Multi-core Programming Thread Checker. 2 Topics What is Intel® Thread Checker? Detecting race conditions Thread Checker as threading assistant Some other.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 5: Threads Overview Multithreading Models Threading Issues Pthreads Solaris.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
Process Concept An operating system executes a variety of programs
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multithreading Allows application to split itself into multiple “threads” of execution (“threads of execution”). OS support for creating threads, terminating.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Threading Methodology. 2 Topics A Generic Development Cycle.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
1. 10/24/ Upon completion of this module, you will be able to: Use Thread Checker to detect and identify a variety of threading correctness issues.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Threads G.Anuradha (Reference : William Stallings)
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Department of Computer Science and Software Engineering
Threads, Thread management & Resource Management.
1 OS Review Processes and Threads Chi Zhang
Threads. Readings r Silberschatz et al : Chapter 4.
Windows CE Overview and Scheduling Presented by Dai Kawano.
4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Operating System Concepts
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
Tuning Threaded Code with Intel® Parallel Amplifier.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Contents 1.Overview 2.Multithreading Model 3.Thread Libraries 4.Threading Issues 5.Operating-system Example 2 OS Lab Sun Suk Kim.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Using the VTune Analyzer on Multithreaded Applications
Processes and threads.
Processes and Threads Processes and their scheduling
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
Parallel Algorithm Design
Operating System Concepts
Prof. Chih-Hung Wu Dept. of Electrical Engineering
Chapter 4 Multithreading programming
Intel® Parallel Studio and Advisor
Chapter 4: Threads.
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Tuning Threading Code with Intel® Thread Profiler for Explicit Threads
Multithreaded Programming
Operating System Introduction.
Chapter 4: Threads & Concurrency
Lecture 2 The Art of Concurrency
Foundations and Definitions
Chapter 4: Threads.
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Multi-core Programming Thread Profiler

2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features Define Critical Path Analysis Examine Thread Profiler data views available Review common performance issues of multithreaded applications – Focus on Load imbalance – Focus on Synchronization contention Describe general optimizations to gain better performance

3 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Motivation Developing efficient multithreaded applications is hard New performance problems are caused by the interaction between concurrent threads – Load imbalance – Contention on synchronization objects – Threading overhead

4 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Intel® Thread Profiler Plugs in to the VTune™ performance environment – Instrumentation-based data collector in VTune Identifies performance issues in OpenMP* or threaded applications using the Win32* API, POSIX* threads, and Intel® Threading Building Blocks Pinpoints performance bottlenecks that directly affect execution time

5 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Intel® Thread Profiler Features Supports several different compilers – Intel® C++ and Fortran Compilers, v7 and higher – Microsoft* Visual* C++.NET* 2002, 2003 & 2005 Editions Integrated into Microsoft Visual Studio.NET* IDE Binary instrumentation of applications Different views and filters available to assist and organize analysis Uses critical path analysis

6 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads critical path is the longest execution flow The critical path is the longest execution flow What is the Critical Path? Threaded applications contain multiple execution flows A new flow is created when a thread is created or resumes Flow ends when a thread terminates or blocks on a synchronization primitive Thread 1 Thread 2 Thread 3 T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 Acquire L Threads 2 & 3 Done Acquire L Wait for Threads 2 & 3 Release L Acquire lock L Wait for L Release LWait for L Thread 2 terminates Thread 3 terminates Thread 1 terminates

System Utilization – Relative to the system executing the application Idle: no threads Serial: a single thread Under Utilized: more than one thread, less than cores Fully Utilized: # threads == # cores Over Utilized: # threads > # cores Thread interaction categories Cruise: threads running without interference Overhead: thread operation overhead Blocking: thread waiting on external event Impact: thread preventing some other thread from executing 7 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Critical Path Analysis If the critical path is shortened, the application will run in less time

8 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Thread 1 Thread 2 Thread 3 T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 Acquire lock L Wait for Threads 2 & 3 Wait for L Release LWait for L Release L Acquire L Threads 2 & 3Done System Utilization Examines processor utilization to determine concurrency level of the application Concurrency is the number of active threads Categorization shown for a system configuration with 2 processors IdleSerialFully UtilizedUnder UtilizedOver Utilized Concurrency Level Time

9 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Execution Time Categories Analyze thread interaction and behavior along critical path Record objects that cause CP transitions Cruise timeOverheadBlocking timeImpact time Categorization shown for a system configuration with 2 processors Thread Interaction Time Thread 1 Thread 2 Thread 3 T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 Acquire lock L Wait for Threads 2 & 3 Wait for L Release LWait for L Release L Acquire L Threads 2 & 3 Done

10 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Merging Concurrency and Behavior Concurrency Level Critical Path Thread Behavior Time Start with system utilization Further categorize by behavior

11 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Thread Profiler Views Critical Path View – Shows breakdown of the critical path Profile View – Shows the breakdown of selected critical paths – User can select other views of the selected profile – Concurrency level, threads, objects Timeline View – Shows thread activity and critical path transitions for the entire application Source View – Transition source view, creation source view

12 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Thread Profiler Proflie View Profile Pane Timeline Pane

13 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Profile Pane – Concurrency Level View Concurrency Level View Two threads ran in parallel ~33% of the time Ran single threaded ~65% of the time Let’s look at the Thread View

14 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Profile Pane – Thread View Time on the Critical Path Active time of the thread Lifetime of the thread Let’s look at the Object View

15 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Profile Pane – Object View This object caused all of the impact Let’s look at Timeline View

16 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Timeline Pane

17 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Source View

18 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Common Performance Issues Load balance – Improper distribution of parallel work Synchronization – Excessive use of global data, contention for the same synchronization object Parallel Overhead – Due to thread creation, scheduling.. Granularity – No sufficient parallel work

19 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Load Imbalance Unequal work loads lead to idle threads and wasted time Busy Idle Time Thread 0 Thread 1 Thread 2 Thread 3 Start threads Join threads

20 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Redistribute Work to Threads Static assignment – Are the same number of tasks assigned to each thread? – Do tasks take different processing time? Do tasks change in a predictable pattern? – Rearrange (static) order of assignment to threads Use dynamic assignment of tasks

21 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Redistribute Work to Threads Dynamic assignment – Is there one big task being assigned? Break up large task to smaller parts – Are small computations agglomerated into larger task? Adjust number of computations in a task More small computations into single task? Fewer small computations into single task? Bin packing heuristics

22 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Unbalanced Workloads Threads are unbalanced Active Times not equal

23 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Synchronization By definition, synchronization serializes execution Lock contention means more idle time for threads Busy Idle In Critical Thread 0 Thread 1 Thread 2 Thread 3 Time

24 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Synchronization Fixes Eliminate synchronization – Expensive but necessary “evil” – Use storage local to threads Use local variable for partial results, update global after local computations Allocate space on thread stack ( alloca ) Use thread-local storage API (TlsAlloc) – Use atomic updates whenever possible Some global data updates can use atomic operations (Interlocked API family)

25 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Atomic Updates Use Win32 Interlocked* intrinsics in place of synchronization object static long counter; // Fast InterlockedIncrement (&counter); // Slower EnterCriticalSection (&cs); counter++; LeaveCriticalSection (&cs);

26 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Synchronization Fixes Reduce size of critical regions protected by synchronization object – Larger critical regions tie up sync objects longer; other threads sit idle longer waiting to acquire objects – Only accesses to shared variables need to be protected

27 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Synchronization Fixes Use best synchronization object for job – Critical Section Local object Available to threads within the same process Lower overhead (~8X faster than mutex) – Mutex Kernel object Accessible to threads within different processes Deadlock safety (can only be released by owner) Other objects are available

28 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Object Contention This object caused all of the impact What is all this? These four threads… …are impacting threads by this object

29 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads General Optimizations Serial Optimizations – Serial optimizations along the critical path should affect execution time Parallel Optimizations – Reduce synchronization object contention – Balance workload – Functional parallelism Analyze benefit of increasing number of processors Analyze the effect of increasing the number of threads on scaling performance

30 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Intel® Thread Profiler for Explicit Threads What’s Been Covered Identifying performance issues can be time consuming without tools Tools are required to understand and to optimize parallel efficiency and hardware utilization Thread Profiler helps you understand your applications thread activity, system utilization, and scaling performance