Prof. Chih-Hung Wu Dept. of Electrical Engineering

Parallel Processing and Multi-Core Programming in the Cloud-Computing Age
Prof. Chih-Hung Wu Dept. of Electrical Engineering National University of Kaohsiung URL: Multi-Core Programming for Windows. Note: Part of this PPT file is from Intel Software College (Intel.com) 2018/9/20

Agenda Course Introduction Multithreaded Programming Concepts
Tools Foundation – Intel® Compiler and Intel® VTuneTM Performance Analyzer Programming with Windows Threads Programming with OpenMP Intel® Thread Checker – A Detailed Study Intel® Thread Profiler – A Detailed Study Threaded Programming Methodology Course Summary Provide participants the agenda of the topics to be covered in this chapter. 2018/9/20

Intel® Thread Profiler – A Detailed Study
Chapter 6 2018/9/20

Objectives At the end of the chapter, you will be able to:
Use Intel® Thread Profiler to recognize common performance problems in applications while working with Windows threads. Fix common performance problems identified by Intel Thread Profiler in applications that use Windows threads. Ask the participants, what they expect from the session. After the participants share their expectations, introduce the following objectives of the session: As a programmer, you need to create code that reduces the execution time of your applications, increases processor utilization, and enhances the performance of your applications. To reduce execution time, you may convert serial applications to threaded applications. With increase in the complexity of code, new performance problems, such as stalled threads, synchronization delays between multiple threads, parallelization and threading overheads, and under- or over-utilization of processors, may get introduced. Therefore, you need a tool that helps you analyze threading patterns and thread interaction and behavior in the multithreaded applications. Intel has developed a threading tool that can help you pinpoint the locations in the code that directly affect the execution time. It also identifies synchronization issues and excessive blocking time on resources by threads. In this chapter, you will learn about the features and views available with Intel Thread Profiler. In addition, you will learn how to best use this threading performance analysis tool. Explain the objectives of the session. Clarify their queries (if any) related to these objectives. 2018/9/20

Using Intel® Thread Profiler
Definition: Intel® Thread Profiler is a threading tool that you can use to analyze the performance of applications threaded with Windows threads API, OpenMP, and POSIX threads. In Microsoft Windows, Intel Thread Profiler is a plug-in for the Intel® VTuneTM Performance Analyzer. You can use Intel Thread Profiler to perform the following functions: Group performance data by categories such as thread and source location. Sort data according to various criteria such as type of activity time, concurrency level, or object type. Identify source locations in your code that cause performance problems. Intel Thread Profiler helps you analyze the performance of multithreaded applications by using one of the following threading methods: OpenMP Windows API or POSIX Threads (Pthreads) Provide background and introduction to the Intel Thread Profiler. Intel Thread Profiler is a threading tool that you can use to analyze the performance of applications threaded with Windows threads API, OpenMP, and POSIX threads. In Microsoft Windows, Intel Thread Profiler is a plug-in for the VTuneTM Performance Analyzer. Intel Thread Profiler is implemented as a view within the VTune Performance Analyzer. You can use Intel Thread Profiler to perform the following functions: Group performance data by categories such as thread, synchronization object, concurrency level, or source location. Sort data according to various criteria such as type of activity time, concurrency level, or object type. Identify source locations in your code that cause performance problems. Intel Thread Profiler helps you analyze the performance of multithreaded applications by using one of the following threading methods: OpenMP: Intel Thread Profiler graphically displays the performance results of a parallel application. You can use Intel Thread Profiler to: Analyze performance by using different configuration options when your code is run, such as changing the Runtime Engine, the thread scheduling method, or the number(s) of threads used to run your application. Locate sections of the code that adversely affect the parallel running of different threads. Identify the set of directives best suited for parallelizing your code. Estimate the scalability of your code by using multiple processors. Windows API or POSIX Threads (Pthreads): You can use Intel Thread Profiler to analyze the performance of your Windows API or POSIX threaded applications. Intel Thread Profiler helps you: Visualize the thread interaction and behavior in your multithreaded applications. Analyze the impact of the performance of your application by using different synchronization methods, numbers of threads, or algorithms. Compare the scalability of your application across different number of processors. Locate synchronization constructs that delay the execution of your application. Identify sections of code that can be tuned to optimize the sequential and threaded performance. Pthreads is a set of threading interfaces developed by the IEEE (Institute of Electrical and Electronics Engineers). Pthreads specifies the API to handle most of the actions required by threads. Clarify their queries (if any) related to this slide. 2018/9/20

Common Performance Issues
Intel® Thread Profiler helps you identify the following performance issues: Load imbalance: Intel Thread Profiler can collect load balancing data by monitoring activity time of threads during execution of the application. Contention on synchronization objects: You can use Intel Thread Profiler to locate areas of contention in your application, for which the search can otherwise be tedious and time-consuming. Threading overhead: Intel Thread Profiler tracks the time spent in threading API calls to collect data about the amount of time an application spends in overhead. Provide some motivation to study the Intel Thread Profiler. In multithreaded applications, continuous interactions between concurrent threads cause performance problems. Intel Thread Profiler helps you identify the following performance issues: Load imbalance Contention on synchronization objects Threading overhead Explain how can you use Intel Thread Profiler to resolve the common performance issues identified by Intel Thread Profiler. Clarify their queries (if any) related to this slide. 2018/9/20

Intel® Thread Profiler Basics
To identify and resolve performance problems, Intel® Thread Profiler: Supports several compliers Performs binary instrumentation on applications Uses critical path analysis Introduce some of the features of Intel Thread Profiler. Some of the features of the Intel Thread Profiler are: It supports several different compliers. It performs binary instrumentation on 32- and 64-bit applications. It uses critical path analysis. Clarify their queries (if any) related to this slide. 2018/9/20

Supported Compilers Intel® Thread Profiler supports the following compilers: Microsoft Windows Systems: Microsoft Visual C++ .NET 2002 Edition Microsoft Visual C++ .NET 2003 Edition Microsoft Visual C++ .NET 2005 Edition Microsoft Visual C++ v6.0 or Higher Required Software for OpenMP Analysis or Source Instrumentation on Windows: Intel® C++ Compiler for Windows 8.1, Package ID: w_cc_pc_ or Higher Intel® Fortran Compiler for Windows 8.1, Package ID: w_fc_pc_ or Higher Required Software for OpenMP Analysis or Source Instrumentation on Linux: Intel® C++ Compiler for Linux 8.1, Package ID: l_cc_p_ or Higher Intel® Fortran Compiler for Linux 8.1, Package ID: l_fc_p_ or Higher Provide information on supported compilers. You can use Intel Thread Profiler to visualize concurrency, kernel overhead, and sequencing dependencies within the threaded code. Although OpenMP is supported in the Microsoft Visual C++ .NET 2005 Edition compiler, only OpenMP applications compiled with Intel compilers can be analyzed with the OpenMP facilities within Intel Thread Profiler. Clarify their queries (if any) related to this slide. 2018/9/20

Binary Instrumentation of Applications
Binary Instrumentation inserts code into an executable file that collects data about: How threads execute on the hardware within the target system? How threads interact with each other? How much time is spent in threading API calls? You must use binary instrumentation in the following situations: Inaccessibility of an appropriate Intel® compiler. Inability to rebuild the application due to paucity of time. Inaccessibility of the source code to rebuild the application. Explain that how Intel Thread Profiler performs binary instrumentation on 32-bit applications. Define binary instrumentation. Binary instrumentation analysis includes the inspection of the following operations: Thread synchronization Thread creation, termination, suspension, and continuation Blocking APIs such as sleep and I/O operations Windows messaging APIs Explain the utility of binary instrumentation in different situations. Clarify their queries (if any) related to this slide. 2018/9/20

Instrumentation Level
Levels of Instrumentation Intel® Thread Profiler provides the following levels of instrumentation: Instrumentation Level Description All Functions This level instruments dynamically linked application program interfaces (APIs) and statically linked C-runtime APIs. To obtain symbolic information about statically linked C-runtime APIs, build the debug information. API Imports (default) This level instruments dynamically linked APIs. This is the recommended setting. It takes considerably lesser time to instrument than the All Functions setting. Module Imports This level does not instrument any APIs. You can use this level if instrumentation at the other levels fails. Use this level if you do not want to analyze this module or if instrumentation at the other levels fails. Present the several levels of instrumentation provided by Intel Thread Profiler. Intel Thread Profiler provides several instrumentation level settings, from the advance to the basic. If you are unable to run your application from within the VTune Performance environment and you can use the Intel compilers to build your application, you may use source instrumentation. Add the /Qtprofile switch when you compile the application with the Intel compiler and run the application as normal. The compiler adds the required instrumentation to collect data as the code executes. Running the instrumented version of the application generates a data file, tprofile.tp. To analyze the data, import the data file into the VTune Performance environment. Clarify their queries (if any) related to this slide. 2018/9/20

Introduction to Critical Path Analysis
Multithreaded applications contain multiple execution flows. Intel® Thread Profiler for Windows Threads defines an execution flow as the time that a thread runs during the execution of the application. An execution flow: Starts from the beginning of the application. May weave through multiple threads during the lifetime of the run. Ends when a thread waits on synchronization objects or terminates. Splits into an additional flow when a thread creates a new thread or releases a synchronization object to allow another thread to resume. Introduce the concept of critical path as used within Intel Thread Profiler. Define an execution flow. Multithreaded applications contain multiple execution flows. Intel Thread Profiler for Windows Threads defines execution flow as the time that a thread runs during execution of the application. All execution flows start from the beginning of the application and may weave through multiple threads during the lifetime of the run. Two other major characteristics about defining an execution flow are: An execution flow ends when a thread waits on synchronization objects or terminates. An execution flow splits into an additional flow when a thread creates a new thread or releases a synchronization object to allow another thread to resume. Therefore, there can be multiple flows at a time that overlap within an execution, and these execution flows split across threads when the threads interact with each other and the synchronization objects. While analyzing explicitly threaded applications, Intel Thread Profiler keeps track of program execution flow data. It also gathers data about synchronization objects used during the execution. Clarify their queries (if any) related to this slide. 2018/9/20

The critical path is the longest execution flow.
What is Critical Path? Thread 3 terminates Acquire lock L Release L Wait for L Acquire L Thread 3 Thread 2 terminates Wait for L Acquire L Thread 2 Release L Thread 1 terminates Wait for Threads 2 & 3 Threads 2 & 3 Done Thread 1 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Execution Flow of Three Threads in a Multithreaded Application Define a critical path as used within Intel Thread Profiler. Consider a multithreaded application with three threads and five execution flows. The first figure illustrates Thread 1, Thread 2, and Thread 3 in the application and a shared lock, L. Thread 2 and Thread 3 use the shared lock. Thread 1 creates Thread 2 and Thread 3 and waits for them to terminate. E0–E15 represent the execution time stamps. The second figure illustrates the first execution flow for Thread 1 in the application in red. The flow is from E0  E3. The flow terminates when Thread 1 begins to wait for the termination of Thread 2 and Thread 3. Notice that this flow splits into two additional flows at E2 on creation of Thread 2 and Thread 3. The third figure illustrates the second execution flow in the application in orange. The flow is from E0  E5 and gets created as the first flow for Thread 1 splits at E2. The fourth figure illustrates the third execution flow in the application in blue. This flow follows the split from the original flow to Thread 3. This thread acquires lock L at E3 and releases it at E6. The flow terminates at E8 when Thread 3 must block waiting to acquire lock L being held by Thread 2. Notice that this flow will split at E6 when the lock is released and acquired by Thread 2. The fifth figure illustrates the fourth execution flow in the application in yellow. The flow is from E0  E12. This flow terminates at E12 when Thread 2 terminates. The sixth figure illustrates the fifth execution flow in the application in green. The flow is from E0  E13. This flow terminates at E13 when Thread 3 terminates. The last figure illustrates the longest flow of execution for the above example. This longest execution flow: Starts on Thread 1 Moves to Thread 3 at E3 Moves back to Thread 2 at E7 Moves to Thread 3 at E11 Moves finally to Thread 1 at E14 Terminates when Thread 1 finishes at E15 The longest execution flow is known as the Critical Path in the Intel Thread Profiler for Windows. Clarify their queries (if any) related to this slide. The critical path is the longest execution flow. 2018/9/20

What is Critical Path Analysis?
Notes on Critical Path Analysis: Critical Path Analysis shows how threads interact with each other during execution. The critical path for any execution cannot be determined within Intel® Thread Profiler for Windows until the application stops executing. The usage of the term critical path is different in Intel® VTuneTM Performance Analyzer, where it refers to the set of branches through the call graph tree that accounted for the most execution time. Intel Thread Profiler provides information during the thread run that is helpful in efficient utilization of processors. Intel Thread Profiler also provides information on how the execution of threads is impacted by other threads and events. Explain the concept of critical path analysis. Provide the different usage of the term − critical path. The usage of the term critical path is different in Intel VTune Performance Analyzer, where it refers to the set of branches through the call graph tree that accounted for the most execution time. Critical path analysis shows how threads interact with each other during execution. When the application moves from one thread to another, Intel Thread Profiler displays the critical path of the application. You can use the critical path to decide how to use and deploy threads efficiently. If the execution time of the critical path is shortened, the entire application execution time is shortened. Intel Thread Profiler for Windows Threads identifies the threads on the critical path and the objects that cause transitions of the critical path to other threads. Therefore, Intel Thread Profiler can identify which threads block other threads by holding the shared synchronization objects. Intel Thread Profiler identifies synchronization issues and excessive blocking time that cause delays for the programs threaded by Windows Threads and OpenMP threads. It shows thread workload imbalances. Increasing parallelization will help you enhance the performance of the threaded application. To maximize parallelization, Intel Thread Profiler shows where the code executes in serial or with fewer threads than cores so that you can increase the time for which the application executes in parallel regions. Clarify their queries (if any) related to this slide. If the critical path is shortened, the application will run in less time. 2018/9/20

System Utilization A few things to remember about system utilization are: While profiling the execution of a multithreaded application, Intel® Thread Profiler takes snapshots of how the application utilizes the cores in the system. Core utilization is measured as the concurrency level at each stage of the threaded execution. Under Intel Thread Profiler for Windows, the concurrency level is defined to be the number of threads that are active at any given time. A thread is considered active if it is executing or available for execution. Intel Thread Profiler for Windows Threads defines five classifications of concurrency levels: Idle : No active threads Serial : A single active thread Under Utilized : More than one thread, less than cores Fully Utilized : # threads == # cores Over Utilized : # threads > # cores Define and describe the information that is kept during a threaded run and associated with the critical path − system utilization. Intel Thread Profiler collects concurrency level data along the critical path. The five classifications of concurrency level that Intel Thread Profiler for Windows defines are: Idle: This indicates the time when no threads were active on the critical path and all threads are blocked or waiting for external events. Serial: This indicates the time when a single thread was active on the critical path. It is likely that some serial execution is required for particular portions of the application, such as at startup, shutdown, and initialization of global data. Unexpected or excessive time in serial execution may indicate the need for serial tuning or that the parallelism of the algorithm is not being effectively exploited. Under utilized: This indicates the time when the number of active threads is less than the number of available active cores, but more than one. If more threads were active, the processor resources could be better utilized and the execution time of the application could be reduced. In this class, time may also indicate load imbalance between threads. Fully utilized: This indicates the time when the number of active threads is equal to the number of cores. This is the ideal situation because all processing resources are utilized by the application. The primary goal of threaded performance tuning is to maximize the fully utilized time and minimize the idle, serial, and under utilized time. Over utilized: This indicates the time when the number of active threads is greater than the number of cores. The time spent in this class may indicate that the application could execute with fewer threads and maintain current performance levels. Clarify their queries (if any) related to this slide. 2018/9/20

Example: Concurrency Level Classifications
System Utilization (Example) Idle Serial Under-subscribed Parallel Over-subscribed Concurrency Level 15 5 10 Time Acquire lock L Release L Wait for L Acquire L Thread 3 Wait for L Acquire L Thread 2 Release L Wait for Threads 2 & 3 Thread 1 Threads 2 & 3Done E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Present the concept of system utilization and five classifications of concurrency level through an illustration of categorization for a system configuration with two processors. Demonstrate how Intel Thread Profiler gathers system utilization (concurrency level) data along the critical path. In the example, the different concurrency level states are: Serial State: Appears when Thread 1 starts from E0 and continues to and beyond E2. Appears when Thread 3 executes alone between E5 and E7. Appears when Thread 3 tries to acquire the lock L between E8 and E11. Appears from E12 to E13 before Thread 3 merges into Thread 1. Parallel State: Appears when Threads 2 and 3 are running from E3 to E5, and before Thread 2 is waiting to acquire L. Appears when Thread 2 acquires the lock L before E7 and continues until E8 when Thread 3 waits for L to be released by Thread 2. Appears when Thread 3 acquires L before E11 and continues until E12 when Thread 2 terminates. Over-subscribed State: Appears after Thread 1 has spawned Threads 2 and 3, and before Thread 1 waits for these two threads to terminate. Clarify their queries (if any) related to this slide. Example: Concurrency Level Classifications 2018/9/20

Execution Time Categories
To highlight potential performance problems, Intel® Thread Profiler classifies the interactions of threads along critical path into the following categories: Execution time categories provide information about how threads interact with each other with respect to synchronization objects and time waiting for thread termination or other events. Cruise Time : The time in which threads run without interference on critical path Overhead Time : Delay of a transition of the critical path from one thread to the next Blocking Time : Time the current thread spends on the critical path waiting on an external event Impact Time : Time the current thread on critical path delays the next thread on the critical path by holding some synchronization resource Define and describe the information that is kept during a threaded run and associated with the critical path — execution time categories. In addition to system utilization during a thread run, Intel Thread Profiler also gathers data about how threads interact with each other with respect to synchronization objects and time waiting for thread termination or other events. To highlight potential performance problems, Intel Thread Profiler classifies the interactions of threads along critical path into the following categories: Cruise Time: This class refers to the time in which no thread delays the execution of any other active thread by holding a synchronization resource. During this time, threads run without any interference and there is no thread or synchronization interaction that affects the execution time. Overhead Time: This class is the time lapse between the: Release of a synchronization resource, such as a lock, by one thread and acquisition of that synchronization resource by another thread awaiting the resource. Creation and start of a thread. Termination and a join of a thread. This time captures the delay of a transition of the critical path from one thread to the next. It is useful for estimating synchronization, signaling, or system overhead. It is also useful for indicating over-utilization of processor resources with a large number of active threads in the system. Blocking Time: This class is the time the current thread spends on the critical path waiting for an external event or blocking, including timeouts. Although this blocked thread does not delay the next thread on the critical path with a synchronization resource, the blocking APIs delay the transition of the critical path to the next thread. Impact Time: This class is the time that the current thread on the critical path delays the next thread on the critical path by holding some synchronization resource. Therefore, the impact time begins when the next thread starts waiting for a shared resource and ends when the current thread releases that resource. Clarify their queries (if any) related to this slide. 2018/9/20

Execution Time Categories (Example)
Cruise Time Overhead Time Blocking Time Impact Time Thread Interaction 15 5 10 Time Acquire lock L Wait for Threads 2 & 3 Wait for L Release L Acquire L Threads 2 & 3 Done Thread 3 Thread 2 Thread 1 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Present the concept of execution time categories and four classifications of thread interactions through an illustration of categorization for a system configuration with two processors. Demonstrate how Intel Thread Profiler gathers thread interaction data along the critical path. In the example, the different states along the critical path are: Cruise Time: Appears from E0 to E2 when Thread 1 executes alone. Appears after E2 when Threads 2 and 3 are running and before Thread 2 requests L at E5. Appears when Thread 2 acquires L before E7 until Thread 3 waits for access to lock L at E8. Appears when Thread 3 acquires L before E11 until Thread 2 terminates at E12. Appears when Thread 1 resumes serial execution before E14 and terminates at E15. Overhead Time: Starts at E2 and continues until Threads 2 and 3 are created before E3. Starts at E6 with the release of lock L and continues until Thread 2 acquires L before E7. Starts at E10 with the release of lock L and continues until Thread 3 acquires L before E11. Starts at E13 and continues until the flow merges to Thread 1 before E14. Impact Time: Appears when Thread 3 delays Thread 2 from E5 to E6. Appears when Thread 2 delays Thread 3 from E8 to E10. Appears when Thread 3 delays Thread 1 while executing from E12 to E13. Clarify their queries (if any) related to this slide. Example: Execution Time Categories 2018/9/20

Merging Concurrency and Behavior
Start with system utilization Further categorize by behavior Concurrency Level Critical Path Thread Interaction 15 5 10 Time Demonstrate how Intel Thread Profiler combines system utilization and thread interaction. You can determine the utilization of the processing resources of a system during the run of any threaded code by determining the concurrency level. The ideal performance of an application should have the same number of threads actively executing as the number of available processors. Fewer threads than processors lead to idle processing cycles. Over-subscription of resources can be an indication that the same performance may be achieved with fewer threads. Critical path analysis also shows how threads interact with each other during execution. This analysis focuses on the execution flow of the thread whose performance would have the most impact on the overall performance of the application. When you tune the code, you need to concentrate on the critical path. If you compare the two timeline graphs from the two figures discussed on the previous slides, you will notice that all impact time happened during serial execution. In addition, all overhead time was also during serial execution. Therefore, the remaining cruise time happened during the rest of the concurrency levels. Clarify their queries (if any) related to this slide. Histogram After Merging System Utilization and Execution Time Categories 2018/9/20

Intel® Thread Profiler Color Legends
The following table lists the execution time categories and the color legends that Intel® Thread Profiler uses in the critical path: Behavior Processor Utilization Bad Good Over Utilized Idle Serial n = 1 Under (n < p) Full (n = p) (n > p) Impact Blocking Critical Path Overhead Provide a preview about the color coding that Intel Thread Profiler uses. The color scheme chosen for displaying profiled data conveys two orthogonal concepts, which are based on the relationship between the concurrency level (n) and the number of cores (p) on your system. The concepts that define the color scheme are as follows: Processor Utilization: You can categorize the processor utilization as good, bad, or oversubscribed. This concept describes how a program utilizes the processors in the system. It is important that a program containing threaded code runs at the appropriate level of parallelism. The three categories of processor utilization are: Good processor utilization (n = p): Efficient utilization of processors indicates that concurrency level n equals the number of cores p of the machine. Bad processor utilization (n < p): Bad processor utilization indicates that the concurrency level n is less than the number of cores p. Bad processor utilization can be defined as idle, serial, or undersubscribed. Oversubscribed utilization (n > p): Over-subscribed processor utilization indicates that the concurrency level n is greater than the number of cores p. This situation is acceptable but not as good as when the concurrency level matches the number of processors. Program Behavior: Behavior defines another dimension to the way time is spent on the critical path. Different intensities of the processor utilization colors indicate the program behavior. The more detrimental the behavior of a thread on the critical path, the bolder will be the color. The program’s behavior during different execution time categories are: Impact time is displayed when a current thread on the critical path impacts the next thread on the critical path by preventing the latter from running. Impact time implies that there may be some code changes that could reduce the impact time and boost threaded performance of the program. Thus, impact time is shown with the most intense colors. Blocking time is displayed when a thread on the critical path waits on some external or unknown event. The colors displayed for this category are of medium intensity to denote that nothing may affect blocking time. However, excessive blocking time should be investigated for possible tuning opportunities. Overhead time is the time spent in managing threads through threading APIs. For example, this could be the time between one thread sending a signal and a waiting thread receiving that signal, or the time between one thread releasing a lock and a waiting thread acquiring that lock. A bright yellow color is used to denote overhead time to denote that overhead is often unavoidable but is desirable to reduce in an application. Cruise time is the time when no threads delay other threads. Cruise time is displayed with the most muted colors. You may still want to explore the possibilities of utilizing more threads when the concurrency level is not optimal. However, coding problems that cannot be tuned during cruise time. Clarify their queries (if any) related to this slide. 2018/9/20

Intel® Thread Profiler Views
The various views that are available with Intel® Thread Profiler are: Profile View Concurrency Level View Thread View Objects View Timeline View Source View Transition Source View Creation Source View Summary View Provide a quick preview and description of the different views that are available to examine the critical path data. There are different views available with Intel Thread Profiler to examine the critical path data. You can use Intel Thread Profiler to: Analyze threading issues that affect performance. Understand your program's threading and synchronization behavior. Identify performance issues such as threading overhead, blocking calls that delay threads, and under or over-utilizing processors. Clarify their queries (if any) related to this slide. 2018/9/20

Activity 1A Objectives:
Run the application through Intel® Thread Profiler. Examine thread activities by reviewing different views offered by it. Question for Discussion: Are there any performance issues? Introduction to the first part of the First Lab Activity. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objectives. 2018/9/20

Profile View 2018/9/20 Profile View Profile Pane Timeline Pane
Describe the features of the Profile View within Intel Thread Profiler. When you first view Activity results in Intel Thread Profiler, the Timeline and Profile Views appear in a tiled format as the default view. These views represent the behavior of the threads that can help you understand the program execution. They also summarize the concurrency to help you analyze efficient utilization of parallel processing resources. Profile View is the default view for the Intel Thread Profiler and is displayed when you open an Activity result window. It displays a high-level summary of the time along the critical path of an execution. The default histogram critical path data is divided into concurrency categories, such as serial, under utilized, fully utilized, over utilized, and overhead. You can choose to display data by thread behavior—cruise, blocking, impact, and overhead—or a combination of concurrency and behavior by selecting check boxes located in the right side of the Profile View Legend. The Profile View displays the time spent along at least one critical path. You can see the time threads spend waiting for objects and the objects that cause contention. You can use the Profile View to: Filter out unwanted data so that you can focus on specific data. Filter and group data according to critical path, thread, synchronization object, concurrency level, or source location to analyze the performance of the threads within the application. Show or hide detailed data according to concurrency level—the number of active threads—and behavior. View the Source View to locate performance problem regions in a code. Clarify their queries (if any) related to this slide. Profile View 2018/9/20

Profile View Toolbar − Grouping Controls
Grouping Controls available in Profile View are: Primary Grouping Secondary Grouping Name Purpose No grouping Remove groupings Concurrency level Group by concurrency level Thread Group by thread Object type Group by object type Describe the features of the Profile View within Intel Thread Profiler. You can use toolbar items, such as grouping controls, zoom controls, and navigation controls in the Profile View to access tools and information in Intel Thread Profiler for Windows. Describe the grouping controls available in Profile View. Clarify their queries (if any) related to this slide. Object Group by object Source Group by source Remove filters Remove all currently applied filters 2018/9/20

Profile View Toolbar − Zoom Controls
Zoom Controls available in Profile View are: Item Name Purpose Decrease zoom factor Zoom out on the current view. Increase zoom factor Zoom in on the current view. Restore zoom factor to 1x Return to the unzoomed view. Zoom factor selector Select a zoom factor to apply to the current view. Describe the zoom controls available in Profile View. Clarify their queries (if any) related to this slide. 2018/9/20

Profile View Toolbar − Navigation Controls
Navigation Controls available in Profile View are: Item Name Purpose Forward Go forward to next state of grouping, filtering, and/or sorting data. Back Return to previous state of grouping, filtering, and/or sorting. Expand\Hide Legend Controls Expand or hide the legend for time categories in Profile or Timeline View. Expand\Hide Controls Expand or contract the current pane. Describe the navigation controls available in Profile View. Clarify their queries (if any) related to this slide. 2018/9/20

Concurrency Level View
Profile View − Concurrency Level View Concurrency Level View Let us look at the Thread View Ran single threaded ~65% of the time Two threads ran in parallel ~33% of the time Describe and show features of the Profile View – Concurrency Level filter. The Concurrency Level View is the default filter applied to the critical path at the initial display of the Profile View. The filter shows the concurrency level data of the critical path. Tooltips for each segment of the histograms appear when you position the pointer on a histogram bar segment. Even without use of the tooltip data, you can visually assess the approximate amount of serial time versus parallel time. Clarify their queries (if any) related to this slide. Concurrency Level View 2018/9/20

Profile View − Threads View
Let us look at the Object View Lifetime of the thread Active time of the thread Time on the critical path Describe and show features of the Profile View – Threads filter. You can click the Thread button to group the view by threads and use the Thread Level Profile View to analyze how each thread contributed to the time on the critical path. The Threads View filter displays all the threads in an execution and shows the following: Life time of the thread from creation to termination. Active time of the thread for which the thread is ready or actively running. Amount of time each thread used in the application spends on the critical path. When you position the pointer on the label below a bar, information about the ID of the corresponding thread, its start routine, its lifetime, the ID of the parent thread, and the source code location at which the former thread is created by the parent thread is displayed. Clarify their queries (if any) related to this slide. Threads View 2018/9/20

Profile View − Objects View
Let us look at Timeline View This object caused all of the impact Describe and show features of the Profile View – Objects View filter. The Objects View filter appears when you click the Objects button. It shows synchronization objects that were used in the execution of the application. The histogram bars for each object describes the amount of time the object was involved on the critical path. Look for objects with impact time associated with them. If you modify the use of these objects, you may reduce the overall critical path time and lower the total execution time. The Objects View displays the time associated with software objects that caused synchronization and potential threading delays. The ToolTip shows information about the object. Synchronization objects also have lifetimes within an application. After grouping by object, the halo behind a critical path histogram bar reflects the amount of time the given synchronization object is available from initialization to destruction. The synchronization object lifetime halos are visible only in the object grouping in the same way as the thread lifetime halo is visible only under the Thread grouping. Another view that Intel Thread Profiler displays is the Timeline View. You can expand the Timeline to open this view. Clarify their queries (if any) related to this slide. Objects View 2018/9/20

Timeline View 2018/9/20 Timeline View
Describe and show features of the Timeline View. The Timeline View represents the lifetime of a thread, showing its start time and end time, and helps you analyze the thread behavior and thread interactivity. It can help you understand reasons for threads being inactive and whether they are on the critical path at that time. In Timeline View, you can show transition lines for all transitions. You can use this view to pinpoint the exact location of performance issues in call stacks and source code. The Timeline View displays a chart with one horizontal bar for each thread in a program. The x-axis of the chart represents time, and the y-axis represents all the threads that were created within the application. A bar describes the lifetime of a thread, including the time that a thread spent on and off a critical path. The Timeline View lists threads on the left and how they are executed from start to finish. The active and wait halos are covered with a simplified color coding of the critical path activity. This simplified scheme uses orange for serial time, red for under utilized time, bright green for fully utilized time, blue for over utilized time, and yellow for overhead time. If you know the thread hierarchy, you can use Timeline View to determine the numbering of threads in Intel Thread Profiler. All threads join the master thread at termination. The view also shows inter-thread relationships using lines between the bars as follows: Fork: When a parent thread creates a child thread, known as a fork, the Timeline View draws a magenta arrow from the bar that represents the parent thread to the left end of the bar that represents the child thread. The positions of the tail end and the head of the arrow reflect the time that the parent thread creates the child thread and the time that the child thread begins execution, respectively. Join: When a thread waits for another thread to terminate, the Timeline View shows a magenta line connecting the right end of the terminating thread's bar and the waiting thread's bar at the point where the waiting thread performs the join operation. Transition: When the critical path moves from one thread to another, the Timeline View shows this transition with a yellow line. When you position the pointer on the transition line, a ToolTip appears to show the synchronization object involved in the transition. Although the involved source code line numbers are also displayed in the ToolTip, in most cases, a context sensitive menu can be used to locate the corresponding source code. Clarify their queries (if any) related to this slide. Timeline View 2018/9/20

Source View Notes on Source View:
To link the Intel® Thread Profiler identification back to the specific object within the source code, right-click the desired histogram bar or critical path transition and choose one of the available Source View commands from the pop-up window. If the application is compiled with debug symbols, a Source View window is generated. Describe features of the Source View. Clarify their queries (if any) related to this slide. 2018/9/20

Source View − Transition Source View
Describe features of the Source View – Transition Source View. Transition Source View shows where an event changed. Within Intel Thread Profiler for Windows, a transition occurs when the critical path moves from one thread to another. At such transitions, Intel Thread Profiler tracks the call sites involved within the application. The Signal location is the point that the current thread on the critical path impacts the next thread on the critical path. If this point occurs earlier, the next thread resumes execution sooner. Consequently, the critical path will be shorter. The Receive location is the point that the next thread is impacted by the Signal location – it is most likely a wait operation. Release and acquisition of synchronization objects cause transitions on the critical path. To bring up a Source View window for finding transition points caused by the synchronization object, right-click an object’s histogram and choose Transition Source View. Intel Thread Profiler associates two source locations with each transition event along a critical path of the execution. Associated threads include a previous thread, a current thread, and the next thread on the critical path. The two source locations are: Signal: The call site of the transition of the critical path out of the current thread. Receive: The call site of the transition on the critical path into the next thread. Clarify their queries (if any) related to this slide. 2018/9/20

Source View − Creation Source View
Notes on Creation Source View: Creation Source view shows where an event began. You can use the Creation Source View to see the location in the source code where an object was created or where an event began. To view Thread Create, Close, or Entry source code, go to the Profile View and group by thread or object. You can right-click any thread or object and then select the Creation/Entry Source View option. If the Creation/Entry Source View option is not available, the source code is unknown. Describe features of the Source View – Creation Source View. Clarify their queries (if any) related to this slide. 2018/9/20

Summary View Describe features of the Summary View. 2018/9/20
The Summary View provides summary information of profiling data in a table format. This view is primarily useful for debugging purposes. Summary View includes data about threads, transitions, transitions per second, APIs, and occurrence. You can click the Summary tab from the default view to open the Summary View. Summary View includes the following data: Threads: The total number of threads that ever existed during your program's execution. Transitions: The total number of transitions. Transitions per second: Total transitions divided by total execution run length (in seconds) APIs. Includes all the APIs called during this run of your program. They may include C runtime API. Occurrence: How many times an API was called during the program run. Contended: An occurrence that was contended. Clarify their queries (if any) related to this slide. Summary View 2018/9/20

Activity 1B Objectives:
Examine the different analysis views offered by Intel® Thread Profiler. Determine if a performance issue is evident from the information presented. Introduction to the second part of the First Lab Activity. Explain the participants the objective for the activity. Review the salient points that were demonstrated in the first lab activity. Note that for large applications, the time to instrument the code will be added to the results the first time. Thus, the user may wish to ignore the first set of results, run the code through Thread Profiler again and use the second results. The instrumentation will have already been done, so there will be no time lost to instrumentation in the second run. Clarify their queries (if any) related to the objectives. 2018/9/20

Common performance issues identified by Intel® Thread Profiler are: Load imbalance Contention on synchronization objects Parallel overhead Granularity Introduce (review) some of the common performance issues that can be found in multithreaded applications. The advent of multi-core technology has enhanced the usage of parallel programming in computing environments. Multithreaded programming has presented programmers with new performance problems that do not occur with serial applications. Most of the problems arise due to the decomposition of the program. In this topic, you will learn to use Intel Thread Profiler to recognize the common issues that affect the performance of multithreaded applications. Clarify their queries (if any) related to this slide. 2018/9/20

Load Imbalance between Threads in a Multithreaded Application
Unequal workloads lead to idle threads, time wastage, and inefficient utilization of processor resources. Thread 0 Busy Thread 1 Idle Thread 2 Thread 3 Give a graphical example of why load imbalance in a multithreaded application is harmful. An idle core and idle threads during parallel execution are considered wasted resources. They can adversely affect the overall run time of parallel execution. When unequal amounts of computation are assigned to threads, the threads with fewer tasks to execute remain idle at barrier synchronization points until threads with more tasks finish execution. Improper distribution of parallel work results in load imbalance. Unequal workloads lead to idle threads, time wastage, and ineffective utilization of processor resources. To maximize the performance of your application, you need to ensure that the work is distributed across multiple threads in a way that they all perform approximately the same amount of work and the processor is kept busy throughout the running of the application. Consider that you spawn four threads, such as Thread 0, Thread 1, Thread 2, and Thread 3 in a program. Assume Thread 0 and Thread 3 execute small computations and Thread 1 executes the complex computations. Therefore, Thread 1 may take more time to complete. The other threads, such as Thread 0 and Thread 3 executing small computations will finish their job and wait idle. Distribution of unequal amount of work among the threads leads to idle threads. Thread 1 is the busiest thread and Thread 0, Thread 3, and Thread 2 are assigned less work and sit idle after completion. This will result in inefficient utilization of the processor resources. To correct this, distribute the work to use four threads with same load. This will increase the utilization of each thread and reduce the program execution time. Clarify their queries (if any) related to this slide. Time Start Threads Join Threads Load Imbalance between Threads in a Multithreaded Application 2018/9/20

Redistribute Work To Threads
You can resolve load imbalance between threads by redistributing work among threads either statically or dynamically: Static Assignment Dynamic Assignment Redistribute Work to Threads Introduce the concept of redistribution of work among threads that can help you resolve common performance issues that can be found in multithreaded applications. You can resolve load balancing issues by first determining the reason for one thread performing more computations than other threads. In most cases, code inspection and reimplementing a more balanced threading model in the application are sufficient. In extreme cases, you may need to restructure the data and do modifications to the code to process the new data formats. Consider an application that implements a functional decomposition threading model with two threads. During application development, you make specific design decisions based on a representative workload that shows good load balance across all tests. Imagine that both threads take almost equal time. Later, you observe that when the workload size reduces, one of the threads finishes faster than the other thread. This example illustrates a common occurrence where load balancing changes dynamically with workload size and explains the need to study a representative workload set that covers a wide range of workload sizes. Clarify their queries (if any) related to this slide. 2018/9/20

Static Assignment Static assignment implies that you assign the work to threads in an application when the computation begins. Consider the following points to ensure a better distribution of work: Check whether the same number of task is assigned to each thread. Check whether different tasks take different processing time during thread computation of the code. If you are able to determine a particular trend, rearrange the static order of assignment of work to the threads. If data is changing from one run to another run, rearrange the work among threads dynamically. Give some ideas about how to fix a load imbalance when the work is statically assigned within the application. Static assignment implies that you assign the work to threads in an application when the computation begins. Consider the following points to ensure a better distribution of work: Check whether the same number of tasks is assigned to each thread. If some threads are assigned less work, rearrange the distribution of work to avoid any idle time. Check whether different tasks take different processing time during thread computation of the code. You may assign an equal number of tasks to threads, but each task may take a different processing time. If some of the tasks take longer to execute, check whether the tasks change in a predictable manner. If you are able to determine a particular trend, rearrange the static order of assignment of work to the threads. If data is changing from one run to another run, rearrange the work among threads dynamically. If the work is unevenly spread across the threads, especially for regions that are performed multiple times during the course of execution, the load is likely to change from one pass to the next. There are some larger tasks being assigned to threads that take a longer time to execute the assigned task. To predict the tasks and distribute the work evenly across the threads, assign smaller tasks to threads dynamically. Clarify their queries (if any) related to this slide. 2018/9/20

Dynamic Assignment Dynamic assignment implies that you assign the work to threads as the computation proceeds. Consider the following points to ensure a better distribution of work: Check whether one big task is assigned to a thread. Check whether small computations different tasks take different processing time during thread computation of the code. Check whether small computations accumulate into a large task. Adjust the number of computations in a task. Check whether there are more small computations or less small computations in a single task. Try and use Bin packing heuristics. Give some ideas about how to fix a load imbalance when the work is dynamically assigned within the application. Dynamic assignment implies that you assign work to threads as the computation proceeds. Consider the following points to ensure better distribution of work: Check whether one big task is assigned to a thread. If a thread is assigned a big task, break the task into smaller subtasks and assign these subtasks to threads. Check whether small computations accumulate into a large task. In such a case, adjust the number of computations in a task. Check whether there are more small computations or less small computations in a single task. Try to use Bin packing heuristics. This is an extensive computer science research area of study. Bin packing heuristics take a set of objects and fill fixed-size bins with the objects such that each bin contains roughly the same amount of objects by weight, size, and other categories. Clarify their queries (if any) related to this slide. 2018/9/20

Intel® Thread Profiler View to Identify Load Imbalance
Unbalanced Workloads Threads are unbalanced Active Times not equal Show how load imbalance between thread workloads can be identified from Intel Thread Profiler. Consider a multithreaded application that has four threads with different amounts of work assigned to these threads. Using the Intel Thread Profiler view, identify the idle and active time for each thread. This helps you identify load imbalances. Varying active time for each thread indicates that different amounts of work were assigned to threads. Question for Discussion: Does the amount of time a thread spends on the critical path demonstrate a load imbalance? Clarify their queries (if any) related to this slide. Intel® Thread Profiler View to Identify Load Imbalance 2018/9/20

Activity 2 Objective: Use Intel® Thread Profiler to find a load imbalance performance problem within a threaded application. Review Points: Thread View can be used to determine activity levels of each thread within the application. Timeline View enables you to understand the thread activity in your application. Introduction to the Second Lab Activity. Explain the participants the objective for the activity. Review the lessons of the Second Lab Activity. Clarify their queries (if any) related to the objectives. 2018/9/20

Intel® Thread Profiler View to Identify Thread Contention
Reducing Contention on Synchronization Objects Thread 0 Thread 1 Busy Idle Thread 2 In Critical Thread 3 Give a graphical example of why synchronization contention is a major issue in multithreaded applications. Synchronization contention is a major issue in multithreaded applications. Frequent use of synchronization objects is a result of large amount of data sharing between threads. This can negatively impact the performance of your application. Data sharing between threads results in an increase in synchronization overheads. Synchronization overheads increase because a thread needs to wait for the release of the synchronization objects prior to acquiring them. Sections of code protected by synchronization should be small in size and maintain the correct code. This practice minimizes the idle time that threads spend waiting to gain access to the protected code sections. It is a good programming practice to strive for minimal data dependency between threads. You should allow threads to execute independently in parallel and eliminate idle wait time due to synchronization. In multithreaded applications, locks are used to synchronize entry to critical regions of code that access shared resources. While one thread is inside a critical region, no other thread can enter the critical region. Therefore, critical regions serialize execution of the code within the region. In addition, excessive synchronization can result in lock contentions. Threads may sit idle waiting for an object or a lock held by another thread inside a critical region. Idle threads affect the performance of your application. Therefore, you should try to minimize the lock contention on synchronization objects in multithreaded applications. Synchronization contention is caused by excessive synchronization. If you create more threads and you need to access the shared data by all these threads in a synchronized way, it will lead to contention. Consider an application in which you have created four threads: Thread 0, Thread 1, Thread 2, and Thread 3. You want to access a shared variable by each thread and update it by each thread. You need to implement synchronization in a way that only one thread can access the variable at a given time and other threads have to wait. Notice the amount of time each thread spends being idle awaiting access to the critical region. If you increase the number of threads, the wait time to access the shared variable will increase. If you have more synchronized blocks and more threads, the wait time may increase, if your application is not designed to handle contention. Clarify their queries (if any) related to this slide. Time Intel® Thread Profiler View to Identify Thread Contention 2018/9/20

Synchronization Fixes
Some potential solutions to eliminate or reduce the use of synchronization objects are: Eliminate Synchronization Use Local Storage Use Atomic Updates Minimize Critical Regions Minimize Critical Section Use Mutex The intrinsic used on Windows to perform atomic increment of the shared counters is: static long counter; // Slow EnterCriticalSection (&cs); counter++; LeaveCriticalSection (&cs); static long counter; // Fast InterlockedIncrement (&counter); Provide some potential solutions to eliminate the need for synchronization objects. Synchronization is expensive and in most cases unavoidable. You can use Intel Thread Profiler and VTune Performance Analyzer Event-Based Sampling to identify issues related to frequent synchronization. After you identify the issue, the next step is to take actions to resolve it. Some potential solutions to eliminate or reduce the use of synchronization objects are: Eliminate Synchronization: The best solution is to eliminate the need for synchronizing resources in your application. However, this may not be possible if the application uses shared data resources among threads. Use Local Storage: A common method to reduce the usage of synchronization objects is to use local variables instead of protected global variables. Periodic updates of global variables from local copies reduce the number of times a synchronization object needs to be acquired. You may be able to implement local storage by allocating more space on a thread’s stack or by using thread-local storage APIs. Use Atomic Updates: Use atomic updates for memory locations whenever possible instead of exposing the location to multiple threads. This helps you prevent storage conflicts and reduce the impact of synchronization. On Windows, you can use the InterlockedIncrement intrinsic instead of synchronization objects to perform atomic increment of the shared counters. Minimize Critical Regions: Critical regions ensure data integrity when multiple threads attempt to access shared resources and serialize the execution of code within the critical region. Therefore, the threads should spend minimum time inside a critical region to reduce the amount of time other threads sit idle waiting to gain access. Programmers who use locks need to balance the size of critical regions against the overhead of acquisition and release of locks. Minimize Critical Section: You may encounter several problems while using Critical Sections. A common problem is if a thread holding a Critical Section suddenly crashes or exits without calling the LeaveCriticalSection() function. Critical Sections are not kernel objects. The kernel cannot clear the operating system resources associated with the thread because the kernel cannot identify the thread crash. The next thread to wait for the Critical Section cannot test the object for abandonment. So, the next thread, and any other thread to wait on the Critical Section, will become deadlocked. You can overcome this problem by using kernel objects such as mutex. Use Mutex: If required, a mutex can be established and shared between multiple processes and between different threads within the same process. Creating and using a mutex object inside the kernel involves large overheads. If a thread owns a mutex, no other thread can acquire it until the thread holding it releases it. The occurrence of deadlock is rare because mutexes can be recovered by other threads if the holding thread terminates before releasing. Such a mutex is called abandoned and will be acquired by the next thread that attempts to hold it. Clarify their queries (if any) related to this slide. 2018/9/20

Intel® Thread Profiler View to Identify Object Contention
Identifying Object Contention What is all this? These four threads… This object caused all of the impact …are impacting threads by this object Demonstrate how synchronization object contention is identified in Intel Thread Profiler and how to use the second level filtering can be done. In the first figure, the impact time has been attributed to one CRITICAL_SECTION object. When you position the pointer on the object name box, a ToolTip appears with information that you can use to identify the point in the source code where the object was created and destroyed. You can determine the exact object under consideration. Question for Discussion: You may have a case where one or multiple threads are solely responsible for the impact time. You can focus your tuning efforts on these threads and make the fixes. How can you identify the threads that result in the impact time using a given synchronization object? Clarify their queries (if any) related to this slide. Intel® Thread Profiler View to Identify Object Contention 2018/9/20

Activity 3 Objectives: Use Intel® Thread Profiler in order to find a synchronization contention performance issue within a threaded application. Demonstrate some of the filtering capabilities of the Intel Thread Profiler. Goals: Understand the thread activity in the threaded version of the numerical integration example. Use the Thread Profiler groupings. Examine synchronization and its effect on performance. Fix the performance issues. Review Points: Grouping objects and threads provide the information on which objects impact what threads. Apply the heuristics from labs for locating bottlenecks in the source code. For longer running applications, the difference in first and second runtimes is negligible. Introduction to the Third Lab Activity. Explain the participants the objective for the activity. Review the lessons of the Third Lab Activity. Clarify their queries (if any) related to this slide. 2018/9/20

Parallel Overhead Defining Parallel Overheads:
Parallel time is the amount of time spent in running code within parallel sections. An efficient and well-tuned threaded application will spend the majority of its time running multiple threads. Parallel overhead is an estimate of the time spent by threads within parallel regions, which are not executing the user's code. Resolving Parallel Overheads: You can reduce parallel overheads if you parallelize your applications properly. Explain how parallel overheads affect the multithreaded applications. Parallel overhead is an estimate of the time spent by threads within parallel regions, which are not executing the user's code. For example, parallel overheads can include time spent in: Starting and stopping threads. Assigning work to threads. Other book-keeping activities. Clarify their queries (if any) related to this slide. 2018/9/20

Granularity Notes on Granularity:
In parallel computing, granularity is the ratio of computation in relation to communication. It is the amount of work in the parallel task. Fine-grained or tightly coupled parallelism means that the individual tasks are relatively small in terms of code size and execution time. Coarse-grained or loosely coupled parallelism means that tasks are relatively large and more computational work is done between synchronizations. Any increase in the granularity results in a corresponding increase in the potential for parallelism and speedup. Any decrease in the granularity will also correspondingly increase the overheads of synchronization and communication. Explain how granularity affects the multithreaded applications. Fine-grained or tightly coupled parallelism means that the individual tasks are relatively small in terms of code size and execution time. Data is communicated frequently in one or few memory words. Then, the performance can suffer from communication overheads. This means that more frequent synchronization will be required among threads in comparison to the amount of computational work done. Coarse-grained or loosely coupled parallelism means that tasks are relatively large and more computational work is done between synchronizations. The data is transferred among processors rarely. However, if granularity is too coarse, then performance can be prone to suffer from load imbalance more easily. Clarify their queries (if any) related to this slide. 2018/9/20

General Optimizations
The two different types of optimization techniques that you can perform at the application level are: Serial Optimizations: Ensure that you do serial optimizations, library substitutions, and vectorization before threading the applications. Serial optimizations along the critical path should reduce execution time and enhance the performance of your application. Parallel Optimizations: You can do parallel optimizations to reduce the synchronization object contention and balance the workload between threads. You might implement a function decomposition to optimize your parallel code. Present advice on general tuning and some specific tuning advice for parallel applications. Based on measurements and the performance objectives, optimizations can be performed at the application, database, or infrastructure level. Let us look at the two different types of optimization techniques you can perform at the application level: Serial Optimizations: Ensure that you do serial optimizations, library substitutions, and vectorization before threading the applications. This helps avoid time that you may spend on serial modifications later. Serial modifications can reduce the granularity of the computations and slow the effort to thread your code. Serial optimizations along the critical path should reduce execution time and enhance the performance of your application. Parallel Optimizations: You can do parallel optimizations in your application to reduce the synchronization object contention and balance the workload between threads. You might implement a functional decomposition to optimize your parallel code. Clarify their queries (if any) related to this slide. 2018/9/20

Summary Intel® Thread Profiler helps you tune multithreaded code faster for optimized performance of multi-core processors. It graphically illustrates performance and helps to locate threading bottlenecks, such as parallel overhead, excessive synchronization, and load imbalance. Intel Thread Profiler supports several different compliers, performs binary instrumentation on 32- and 64-bit applications, and uses critical path analysis. The longest execution flow is known as the critical path in Intel Thread Profiler for Windows. Intel Thread Profiler for Windows defines five classifications of concurrency level: idle, serial, under utilized, fully utilized, and over utilized. Along the critical path, Intel Thread Profiler for Windows defines four classifications of thread interactions: cruise time, overhead time, blocking time, and impact time. Summarize all the key points learned in the chapter. 2018/9/20

Summary (Continued) Critical path analysis shows how threads interact with each other and how threads interact with the cores during execution. This analysis focuses on the execution flow to identify the impact of the flow on overall performance of the application. Intel® Thread Profiler provides different views and filters, such as Profile View, Timeline View, and Source View, to assist and organize analysis of your program's threading and synchronization. Static assignment implies that you assign the work to the threads in an application when the computation begins. Dynamic assignment implies that you assign the work to threads as the computation proceeds. You can use Intel Thread Profile View to identify the idle and active time for each thread. This helps you in identifying the load imbalances. Varying active time for each thread indicates that different amounts of work were assigned to threads. You can eliminate or reduce the need for synchronization objects by using local storage, implementing atomic updates, and minimizing usage of critical regions. Summarize all the key points learned in the chapter. 2018/9/20

Summary (Continued) Parallel overheads can be defined as the amount of time required to coordinate threads instead of executing the application code. Granularity refers to the amount of computation done in parallel, relative to the amount of synchronization needed between threads. Any increase in the granularity results in a corresponding increase in the potential for parallelism and speed up. However, any decrease in the granularity will also correspondingly increase the overheads of synchronization and communication. Serial optimizations, such as library substitutions and vectorization, should reduce execution time and enhance the performance of your application. Parallel optimizations in your application reduce the synchronization object contention and balance the workload between threads. Summarize all the key points learned in the chapter. 2018/9/20

Threaded Programming Methodology
Threading Programming Methodology Chapter 7 2018/9/20

Objectives At the end of the chapter, you will be able to:
Estimate the effort required to thread an application by using the prototyping technique. Use Intel tools to improve the performance of your applications threaded by using OpenMP model. Ask the participants, what they expect from the session. After the participants share their expectations, introduce the following objectives of the session: Threading increases the performance of a threaded application on multi-core or multiple processor platforms. The application uses concurrency and parallel execution to increase its performance. Concurrency helps increase the throughput on single or multi-core systems. However, it also increases the complexity of the design, testing, and maintenance of code. Despite these drawbacks, threading scales up the level of parallelism to the number of cores available and helps in the efficient utilization of system resources. This chapter presents techniques and a methodology to thread a serial application. The first step is to use Intel® VTuneTM Performance Analyzer to analyze a serial application to identify sections of code that could benefit most from threading. You can use OpenMP model to quickly prototype these sections of code and then use Intel® Thread Checker and Intel® Thread Profiler to identify any coding and performance issues specific to threading. If the prototype threading is beneficial, you can use the OpenMP or if the project requires an explicit threading model, translate the OpenMP. Explain the objectives of the session. Clarify their queries (if any) related to these objectives. 2018/9/20

Threading Concepts Revisited
Review the following threading concepts: What is parallelism? Amdahl’s law Processes and threads Benefits and risks of using threads Provide participants the agenda for the next few slides. Revise all the threading concepts discussed in Chapter 1. Clarify their queries (if any) related to these objectives. 2018/9/20

What is Parallelism? Definition:
Parallelism occurs when two or more processes or threads execute simultaneously. You can achieve parallelism for threading architectures in the following two ways: Multiple Processes: Communication between processes happens through Inter-Process Communication (IPC). Single Process with Multiple Threads: Communication between threads happens through the shared memory location. Review all the concepts on parallelism. Explain that the single process, multiple threads, and shared memory is the parallel model that will be used in the session. Parallelism occurs when two or more processes or threads execute simultaneously. It requires multiple cores. For example, threads in a multithreaded application on a shared-memory multiprocessor execute in parallel, and each thread has its own set of execution resources. In order to achieve parallel execution in a software application, hardware needs to provide a platform that supports the simultaneous execution of multiple threads. You can achieve parallelism for threading architectures in the following two ways: Multiple Processes with Multiple Threads: Consider two applications on two different processes. The two processes being separate, threads need to communicate with each other through a message passing protocol, through inter-process communication (IPC). Identify predictable threads that can be executed in parallel, analyze program features, and predict future control flows and data values. Hence, you can improve performance by executing multiple threads simultaneously within multiple processor cores. Single Process with Multiple Threads: Another way to achieve thread level parallelism is to have a single process and spawn threads within that process. The communication between threads in this case happens through the shared memory location. One thread writes data to the agreed upon memory location and the receiving thread reads from that location. It depends upon the programming logic to ensure that the write() function executes before the read() function, otherwise you may receive garbled data. Clarify their queries (if any) related to this slide. 2018/9/20

Serial code limits speedup.
Amdahl’s Law Amdahl's Law: Is a law governing the speedup of using parallel processors on an application, versus using only one serial processor. Describes the upper bound of parallel execution speedup. (1-P) n = 2 n = ∞ (1-P) P T serial n = number of processors Tparallel = {(1-P) + P/n} Tserial Speedup = Tserial / Tparallel P/2 P/∞ … (1-P) Explain and illustrates Gene Amdahl’s observation about the maximum theoretical speedup, from parallelization, for a given algorithm. In chapter 1, you also learned about metrics by which you can measure the performance benefit of parallel programming. You are now aware about the concepts of speedup of a program, which is the ratio of the time it takes a program to execute in serial (with one core) to the time it takes to execute in parallel (with multi-core). Amdahl's Law is a law governing the speedup of using parallel processors on an application, versus using only one serial processor. The more parallelized the software, the better the speedup. Gene Amdahl in 1976 examined the maximum theoretical performance benefit of a parallel solution relative to the best performance of a serial solution. According to Amdahl’s law, the program speedup is a function of the fraction of a program that is accelerated and by how much that fraction is accelerated. According to Amdahl's law, Tparallel = {(1-P) + P/n} Tserial and Speedup = Tserial / Tparallel = 1 / {P + (1-P)/n} Consider a serial code composed of a parallelizable portion P and the serial portion 1-P in equal proportions, where: Tserial is the time it takes to execute the serial code. Tparallel is the time it takes to execute the code in parallel region. n is the number of cores. If you substitute 1 for the number of cores, you will obtain no speedup. If you have a dual-core platform that performs half the work, then the result is, Speedup = 1/ ( /2) = 1/0.758 = 1.33. The above calculation shows a 33 percent speedup because the run time as given by the denominator is 75 percent of the original run time. If P is perfectly divided into two parallel components on two cores, then the overall time Tparallel is reduced to 75 percent. For n=8, Speedup = 1/ ( /8) = 1/0.758 = 1.78. By setting n=∞ and assuming that the best sequential algorithm takes one unit of time, you obtain, Speedup = 1/P. In the limit of very large, perfect parallelization of P, the overall time approaches 50 percent of the original Tserial. Amdahl assumes that the addition of processor cores is perfectly scalable. Speedup = 1/P statement of the law shows that the maximum benefit a program can expect from parallelizing some portion of the code is limited by the serial portion of the code. For example, according to Amdahl’s law, if 10 percent of your application is spent in serial code, then the maximum speedup that can be obtained is 10x, regardless of the number of processors. Endlessly increasing the processor cores only affects the parallel portion of the denominator. Therefore, if a program is only 10 percent parallelized, the maximum theoretical benefit is that the program can run in 90 percent of the sequential time. Questions to Ask Students Does overhead play a role? Answer: Yes, the code for finding prime numbers on the next slide, where threads are more efficient than processes. Are unwarranted assumptions built in about scaling? Do the serial and parallel portions increase at the same rate with increasing problem size? Answer: This can lead to a brief aside about the complementary point of view, Gustafson’s law, which assumes the parallel portion grows more quickly than the serial. Clarify their queries (if any) related to this slide. 1.0/0.5 = 2.00 1.0/0.75 = 1.33 Serial code limits speedup. 2018/9/20

Processes and Threads Relationship of threads with a process:
Modern operating systems load programs as processes and categorize processes as having two roles: Resource holder Execution (thread) Every process has at least one thread, which is the main thread that initializes the process and begins executing the initial instructions. Threads can create other threads within the process. Each thread gets its own stack. All threads within a process share code and data segments. A process starts executing at its entry point as a thread. Stack Thread main() Stack Stack Thread 1 Thread N Review the relationship between processes and threads. Every process has at least one thread, which is the main thread that initializes the process and begins executing the initial instructions. Threads can create threads within the process. Each thread gets its own stack. All threads within a process share code and data segments. A process starts executing at its entry point as a thread. All threads within the process share code and data segments. Clarify their queries (if any) related to this slide. Code Segment Data Segment Threads in a Process 2018/9/20

Threads − Benefits and Risks
Benefits of using threads are: Increased performance Better resource utilization Efficient IPC through shared memory Risks of using threads are: Increased complexity of the application Difficult to debug (data races, deadlocks, and livelocks) Review the benefits and risks involved with threaded applications. Some of the advantages of using the threaded applications are: Increased performance and better resource utilization: Threads can communicate faster with each other because they share the same address space. This allows threads to access the global variables of any process making communication simpler and quicker. Therefore, multithreaded programs can run multiple threads simultaneously, finishing more tasks in less time. This improves the overall system performance and enhances the processor utilization. IPC through shared memory is more efficient: It refers to inter-process communication (IPC), passing messages or data from one process to another. This can be done in an easier way through shared memory than across network wires of distributed systems. For the latter, a thread in a process should initiate a send function. This function moves data from user space to kernel space, the data is moved through the NIC onto the network and into the receiving system's network interface card (NIC). At the receiver, the data is set into kernel space until a receive function call is made to move it into user space buffers. With shared memory, threads merely write into and read from an agreed upon shared location. The threads should ensure that the write completes before the read operation or data may be lost. In addition, using threaded applications has some disadvantages. Some of the risks of using the threaded applications are: Increases complexity of the application: Adding threads and managing the threaded codes increases the complexity of the application. In some cases, it may be necessary to modify the algorithm to make it more amenable to threading. Both of these code changes can add complexity to the application that was not present in the serial version. Difficult to debug: Debugging is difficult with traditional means since threads are scheduled asynchronously. Therefore, repeating errors may not be easy. In addition, the bugs are non-deterministic, that is, they may not occur during each test, and a QA process designed for serial code will very likely miss bugs in threaded code. Using traditional debuggers and other tools can affect the interleaving of thread execution that leads to an error. Thus, if you cannot repeat the bad behavior of the code, using a debugger (or printf statements) overrides the erroneous scheduling, and debugging threaded code can be difficult without specialized tools. Clarify their queries (if any) related to this slide. 2018/9/20

Questions Before Threading Applications
Answer the following questions before threading an application: Where to thread? How much redesign or effort is required? Is it worth threading a selected region? How long will it take to thread? What should be the expected speedup? Will the performance meet expectations? Will it scale when you add more threads or data? Which threading model to use? Explain key considerations for developers beginning to thread an application: Where to thread? Identify the hotspots in the application. The hotspots are the most time-consuming regions. Identifying these regions helps you focus the tuning effort on regions that have the greatest potential for performance improvement. Then, you can thread the code that calls these regions or the regions themselves if the former is not possible. How much redesign or effort is required? Estimate the amount of work required to redesign the code to incorporate threading. This will help you define the time required to enhance the code. Is it worth threading a selected region? After you identify portions of code that you need to thread, analyze the advantages and disadvantages of threading those portions. Evaluate the benefit of threading these portions of code against the amount of time required to thread, test, debug, tune, and maintain the code. How long will it take to thread? Estimate the total time that you will require to thread the code. Then, you can estimate the total cost involved in threading the code. What should be the expected speedup? Estimate the speedup of the application after threading the code by using Amdahl’s Law. Theoretically, this is the maximum speedup you can expect. Will the performance meet expectations? Decide whether the potential speedup can achieve the required speedup of the project goals. For example, if you need to run the threaded code with a 3.5X speedup with four threads on four cores, and the estimated theoretical limit is only 2.5X on four cores. In such a case, you need to determine whether there is more parallelism that you can use or whether the project is worth threading. Will it scale when more threads or data are added? Consider whether the threading changes proposed are scalable with addition of threads, data, or processor cores. This is important because future platforms may have additional cores that allow you to use more threads and increase the size of the datasets. You can include this feature in the design if you require scalability in the future. Which threading model to use? Choosing the right threading model minimizes the amount of time you spend in modifying, debugging, and tuning a threaded code. For compiled applications, you can choose between native models or OpenMP model. Clarify their queries (if any) related to this slide. 2018/9/20

Case Study − Prime Number Generation
Consider the code to find prime numbers: i factor bool TestForPrime(int val) { // let us start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); 3 2 5 2 Explain the prime number algorithm and code to be used for all the nine lab activities in this chapter. This serial code searches for prime numbers within a given range of positive integers. Each odd number in the range is tested by dividing it by all the potential factors. If a factor evenly divides the tested number, the number is a composite number. If all the factors leave a remainder when you divide the tested number, the number is a prime number. Two functions are important in this computation. The FindPrimes() function receives the range for testing, then proceeds to call the TestForPrime() function on each odd number in the range. The loop iterator is incremented by 2. It is assumed that start is an odd number greater than equal to the number 3, which is guaranteed in the main program. If the TestForPrime() function returns the value TRUE, a global array is used for storing the prime number. Then, the array index is incremented to store the next prime number found. To keep the user informed about the progress of the computation, the FindPrimes() function calls the ShowProgress() function, which counts the numbers that have been tested. The ShowProgress() function also prints notations of the percentage of the completion of the task on the screen. The TestForPrime() function does the primary testing work by accepting a number to be tested for being prime. This is done after finding the limit of numbers that are considered to be potential factors. The limit for the factor search is set at the square root of the number under consideration. Any factor that is greater than the square root of a number must have a factor that is less than the square root. Then, all the numbers between 3 and the limit are tested whether they are prime numbers. To keep the user informed about the progress of the computation, the FindPrimes() function calls the ShowProgress() function, which counts the numbers that have been tested. The ShowProgress() function also prints notations of the percentage of the completion of the task on the screen. At all iterations of the inner loop, the next factor is calculated by incrementing the local factor variable. This loop will exit in one of the following two conditions: The factor has been incremented to a number greater than the limit. The modulus operation results in a zero. If loop terminates during the second condition, the number is composite. In this case, the function returns FALSE. If no division or modulus results in a zero remainder, and the TestForPrime() function returns TRUE, the factor has been incremented greater than the limit. Two different regions in the code contain independent tasks—you can thread either of these regions. In the TestForPrime() function, each division of potential factors does not rely on the order or results of any other division. In the FindPrimes() function, each possible prime number within the range can be tested independently assuming that the order of the prime numbers is unimportant. The VTune analysis (Lab 2) helps you decide the best place for threading between two such possibilities. However, there is only one region in the code where you can successfully use OpenMP—the FindPrimes() function. The loop in the TestForPrime() function is a while loop and cannot be threaded by using OpenMP. Clarify their queries (if any) related to this slide. 2018/9/20

Activity 1 Objectives: Steps Involved:
Compile and run the serial version of the code for finding prime numbers. Observe the behavior of the serial application. Steps Involved: Locate the PrimeSingle directory. Compile the code with Intel compiler in Visual Studio. Run the code few times with different ranges. Introduction to the First Lab Activity, whose purpose is to build the initial, serial version of the application. Explain the participants the objectives for the activity. Clarify their queries (if any) related to the objectives. 2018/9/20

Design and Implementation
Development Methodology Analysis Design and Implementation Debugging Tuning Phases of Generic Threading Methodology A typical threading methodology consists of the following phases: Analysis: Find computationally intense code. Design and Implementation: Determine how to implement threading solution and select a threading model to build concurrency in the application. Debugging: Detect any problems resulting from using threads. Tuning: Achieve the best parallel performance. Define the methodology to use when migrating a serial application to a threaded one. A typical threading methodology consists of the following phases: Analysis: Measure the performance of a serial application and identify the critical components of code you can multithread during the analysis phase. Design and Implementation : After analysis, determine whether you can decompose the computational work of the application into independent tasks and design multithreading methods. Select an appropriate threading model to build concurrency in the application. Debugging: Check whether the resulting multithreaded code generates accurate results. Tuning: Compare the performance of the multithreaded application with the serial application and verify the performance gains. Tune the multithreaded application for maximum performance gains and eliminate all problems. Clarify their queries (if any) related to this slide. 2018/9/20

Design and Implementation
Tools and Development Methodology You can apply the knowledge of Intel’s tools in the development cycle for the design and development of any threaded application. Analysis Design and Implementation Debugging Tuning Intel® VTuneTM Performance Analyzer Intel® Thread Checker Intel® Debugger Intel® Thread Profiler VTune Performance Analyzer Intel® Performance Libraries: IPP and MKL OpenMP (Intel® Compiler) Explicit Threading (Windows, P threads) Assigns details and visually reinforce the points made on the previous slide. Specific tools and threading models are inserted into the general outline made on the previous slide to point out the iterative nature of both debugging and the overall development cycle. The different ways in which you can use different Intel tools during the development stages of the threaded application are: Analysis: During the analysis phase, you need to profile a serial application to determine the regions of the application that would benefit most from threading. Then, choose an appropriate threading model for these regions in which threading is a viable option. To do this, first determine the type of parallelism that characterizes each candidate section. You can use Intel® VTuneTM Performance Analyzer in the analysis phase to identify critical paths by using Call-Graph Analysis. You can also use VTune to perform Sampling analysis for determining the hotspots along the critical path. Design and Implementation: After profiling your application, select an appropriate design strategy for threading. In the design phase, you identify what computations can be executed concurrently. Then, select an appropriate threading model. The implementation phase involves converting the design elements to actual code by using an appropriate threading model. Before you write the code to create and manage threads, you need to inspect the code and determine whether you can use any threaded performance library functions in the application. If you can use any of the Intel Performance Libraries such as integrated performance primitives (IPP) and math kernel library (MKL), you can achieve your threading performance goals without considering more drastic changes or re-engineering the source code. You can use Intel compilers to parallelize certain loops in your code. Use the compiler auto-parallel reports to determine how much of the code was parallelized or why the other loops should not be made parallel. Then, you can modify the code to correct the concerns of the compiler. Debugging: During the debugging phase, you fix the bugs to ensure the correctness of the application and meet product requirements. You can use Intel Thread Checker and Intel Debugger to debug multithreaded applications. Debugging threaded applications is very tricky because debuggers change the run-time performance, which may mask race conditions. You can use Intel Thread Checker to locate threading errors such as race conditions, deadlocks, stalled threads, lost signals, and abandoned locks. Intel Thread Checker provides very valuable parallel execution information and debugging hints. It monitors the OpenMP pragmas, Windows threading APIs, and all memory accesses by using source code or binary instrumentation. Tuning: After debugging, you need to tune your code for maximum performance. During the tuning phase, examine the execution performance and the correctness of the threaded application. You can use Intel Thread Profiler to identify and locate bottlenecks that limit the parallel performance of your multithreaded application. It performs program execution flow and critical path analysis to determine whether any threading delay in a multithreaded application will affect the overall execution time. VTune Performance Analyzer analyzes your application to find hotspots. For those hotspots already threaded, examine those parts of the code to determine if the changes will affect performance. If hotspots are not threaded, you can consider threading your code in order for these new points in your application to run concurrently. Clarify their queries (if any) related to this slide. Tools Used During Different Stages of Threading Methodology 2018/9/20

Analysis Phase The goal of the analysis phase is to determine the regions of potential parallelism in the application. You need to answer the following threading questions during the analysis phase: Where to thread? Is it worth threading a selected region? Provide participants the agenda of the next few slides. The goal of the analysis phase is to measure the performance of the serial application and determine the regions of potential parallelism in the application. Baseline timing is part of the overall analysis, which is necessary to measure the impact of any threading efforts during the Testing phase. You need to answer the following threading questions during the analysis phase: Where to thread? Is it worth threading a selected region? You should carefully examine the regions of the code that consume a large percentage of run time. If you can thread code that executes the hotspots of the application, you will reap the most benefit. Find regions of the code that appear to have minimal dependencies and appear to be data-parallel. This will decrease the amount of threading work and restructuring of code. To measure the performance of a serial application, you can use a representative workload that uses most of the code paths being analyzed. The primary tool that you use in this phase is Intel® VTuneTM Performance Analyzer, which helps you to identify the regions in the code where maximum computations will be performed. After you select the workload, run the application on the workload to gather both sampling and call graph analysis. Clarify their queries (if any) related to this slide. 2018/9/20

Identifies the time-consuming regions.
Analysis − Sampling Use VTune Sampling to find hotspots in your application: Let us use the project PrimeSingle for analysis: PrimeSingle <start> <end> Usage: ./PrimeSingle bool TestForPrime(int val) { // let us start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); Introduce and explain further the role of VTune Sampling: Initially state the workload (find all primes between 1 and ). Show an extract from the VTune user interface highlighting the function TestforPrime(). Show the corresponding source code fragment. The objective of the analysis phase is to find suitable locations for threading. Threading code that executes the hotspots identified by VTune Sampling provides a good opportunity to get the maximum benefit from threading. OpenMP makes it practical to add threading to all the lowest-level loops in your code. Each may have only a small incremental performance gain, but these may all add up to a significant gain overall. In the code example to find the prime numbers discussed earlier, you can use VTune Sampling to locate the most time-consuming functions. The tool should point to the TestForPrime() function. Thus, the programmer should initially thread within the TestForPrime() function. Clarify their queries (if any) related to this slide. Identifies the time-consuming regions. 2018/9/20

Used to find proper level in the call-tree to thread
Analysis − Call Graph Use VTune Call Graph to find the right level to thread your application: This is the level in the call tree where you need to thread. Introduce and explain further the role of VTune Call Graph: Initial view is excerpt from the call graph user interface (bold red arrows show busiest branch). Assertion made that the FindPrimes() function is the right level to thread. After you identify the most suitable regions of code for parallelization, you use Call Graph analysis to determine the type of parallelism to implement. It is highly recommended by Intel that you should do the Call graph analysis after Sampling for the following reasons: Confirm the Sampling results. If you cannot thread the function(s) found by Sampling, the call tree can point to functions that call the time-consuming portion of the code, but may be more amenable to threading. It is more often the case that to create a solution that will scale to larger numbers of cores and threads, it will be beneficial to find the highest point in the call tree that can be threaded, but calls the hotspot routines. Call Graph analysis is appropriate because sampling information may not be suitable for all types of applications. Sometimes, sampling results will have flat profiles, and deciding the right code level at which you can thread may not be apparent. Alternately, the hotspots may point to a function that results in the most amount of execution time because it also has the most number of calls. Use the call graph data to understand the problem better. If you run the Sampling and Call Graph analyses on the code example to find the prime numbers, it appears that the TestForPrime() function is the most time-consuming function. However, if you divide the Total Self Time (1,454,635 microseconds) by the number of calls (500,000), you will find that the average execution time of the TestForPrime() function is less than 3 microseconds per call. This time is not sufficient to overcome the overhead of threading with the TestForPrime() function. Therefore, you will find that the FindPrimes() function is the next function in the hierarchy. If you can thread that function, calls to the TestForPrime() function would run concurrently and keep the granularity of the threaded computations in that function at the coarsest level possible. Algorithmic examination of the functions will show that the TestForPrime() function could not be threaded by using OpenMP due to the while loop. While dealing with much larger codes, code examination may not be feasible. Sampling and Call Graph analyses provide you some directed hints as to where threading can be most beneficial. Clarify their queries (if any) related to this slide. Used to find proper level in the call-tree to thread 2018/9/20

Analysis Where to thread? Is it worth threading a selected region?
Consider the following for threading decision during the analysis phase: Where to thread? FindPrimes() function Is it worth threading a selected region? Regions of the code that appear to have minimal dependencies Regions of the code that appears to be data-parallel Regions of the code that consumes over 95 percent of the run time Baseline measurement Further analyze the insertion made in the previous slide. In addition, introduce baseline measurement. The goal of the analysis phase is to prepare a baseline measurement of the performance of the serial application and determine the regions of potential parallelism in the application. Consider the following for threading decision during the analysis phase: Where to thread? Is it worth threading a selected region? Regions of the code that appear to have minimal dependencies Regions of the code that appear to be data-parallel Regions of the code that consumes over 95 percent of the run time Baseline timing is part of the overall analysis, necessary to measure the impact of any threading efforts. Clarify their queries (if any) related to this slide. 2018/9/20

Activity 2 Objective: Steps Involved:
Run serial code through Intel® VTuneTM Performance Analyzer to locate portions of the application that use the most computation time. These will be the parts of the code that are the best candidates for threading to make a positive performance impact. Steps Involved: Run code with ‘ ’ range to get baseline measurement. Make note for future reference. Run VTune analysis on serial code. Identify the function that takes the most time. Introduction to the Second Lab Activity, whose purpose is to generate a baseline serial-code measurement and run the VTune sampling analysis. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objective. 2018/9/20

Design and Implement a Threading Solution
After you identify the potentially beneficial point in the code for threading, you need to select a threading model to implement this solution. The various programming models available for threading your code are: Foster’s Design Method Data Decomposition Functional Decomposition Pipelined Decomposition Provide participants the agenda for the next few slides. The decomposition method determines the type of data restructuring while converting a serial application to a threaded application, if required. While dealing with much larger codes, code examination may not be feasible. Sampling and Call Graph analyses provide you some directed hints as to where threading can be most beneficial. Clarify their queries (if any) related to this slide. 2018/9/20

Foster’s Design Methodology
Notes on Foster’s design methodology: Ian Foster, in his book Designing and Building Parallel Programs, proposed a design methodology for parallel programming. The design methodology is an exploratory approach in which machine-independent issues are considered early and machine-specific aspects of design are delayed until late in the design process. This methodology structures the design process as four distinct stages — partitioning, communication, agglomeration, and mapping. In the first two stages, the focus is on concurrency and scalability. The third and fourth stages emphasize locality and other performance-related issues. Introduce Foster’s design methodology for parallel programming. To utilize the benefits of Hyper-Threading Technology, you need to thread your applications correctly. The first step in designing a threaded application from a serial application is to describe the parts of the application that need to be threaded in terms of a parallel programming model. Ian Foster proposed a design methodology for parallel programming. Ian Foster’s 1994 book is well-known to practitioners of this dark art, and his book is available online at Clarify their queries (if any) related to this slide. 2018/9/20

? ( Foster’s Design Methodology (Continued)
Four stages in Foster’s design methodology: Partitioning Divide computation and data to be processed during computation into small tasks. Communication Determine the amount and pattern of communication required to coordinate task execution. Agglomeration Combine tasks into larger tasks to improve performance. Mapping Assign agglomerated tasks to processors or threads. ? Problem Communication ( Initial Tasks Graphically illustrate the four steps in Foster’s design methodology. The stages of Foster design methodology are: Partitioning: In the partitioning phase, decompose the computation and data into small tasks. Define a large number of small tasks to obtain a fine-grained decomposition of the problem. You can then use domain and functional decomposition to various components of a single problem or to the same problem to obtain alternative parallel algorithms. Communication: The independent tasks generated by the partitioning stage execute concurrently. Generally, computation for one task requires data associated with another task. Therefore, for the computational progress, it is essential to transfer the data between the tasks. To transform data, two tasks need to communicate as a channel on which one task can send messages and the other can receive messages. In the communication phase, you specify this information flow and determine the communication required to coordinate task execution. Agglomeration: In the agglomeration phase, estimate the performance requirements and the cost to implement the task and communication structures defined in the first two stages. You can combine small tasks into larger tasks to improve performance or reduce development costs. Revisit the decisions you made in the partitioning and communication phases to obtain efficient execution on parallel computers. Try to reduce the communication costs and retain flexibility in terms of scalability. Mapping: In the mapping phase, assign each task to a thread or core. In this phase, you need to maximize core utilization and minimize communication costs. You can specify mapping statically if tasks have approximately the same amount of work. To achieve load balance with unknown or uneven amounts of work per task, you can specify mapping of tasks dynamically. Clarify their queries (if any) related to this slide. Combined Tasks Stages of Foster’s Design Methodology Final Program 2018/9/20

Example: Weather Modeling Application
Functional Decomposition Hydrology model Atmosphere Model Land Surface Model Ocean Model Example: Weather Modeling Application Hydrology Model Functional Decomposition: The technique used is data parallelism. The technique used is task-level parallelism. The focus is on the computation to be performed. Divide computation into smaller tasks. Then associate each task to the corresponding data on which it operates. All the tasks are independent and constituents of the same computation problem. Illustrate by example, the functional or task decomposition in weather modeling application. Functional Decomposition: Functional decomposition involves dividing the computation into small, independent tasks and then associating the corresponding data on which a task operates. This decomposition method is traditionally used to thread desktop applications that include independent tasks such as screen update, disk read, disk write, and print. In functional decomposition, focusing on computations can reveal structure in a problem. Functional decomposition works at a different level than data decomposition. In functional decomposition, you map independent work to asynchronous threads. Each domain in the above application, such as atmosphere, hydrology, ocean, and land surface can be treated independently, leading to a task parallel design. For functional-decomposition problems, it is often best to use explicit threading because these are situations where you can define the roles of different threads by having different functionalities. Clarify their queries (if any) related to this slide. 2018/9/20

Example: Ocean Modeling Application
Data Decomposition Data Decomposition: The technique used is data parallelism. Decompose the data associated with a problem into small pieces of approximately equal size. Associate each operation with the data on which it operates. Same operation is performed on different data. Illustrate by example, the data decomposition in weather modeling application. Data Decomposition: In the domain decomposition approach to problem partitioning, you need to divide the data associated with the problem into small pieces of equal size. Then, associate the computation with the data on which it operates. This partitioning yields a number of tasks, each comprising some data and a set of operations on that data. The data that is decomposed may be the input to the program, the output computed by the program, or intermediate values maintained by the program. Different partitions may be possible based on different data structures. You can first focus on the largest data set or on the data accessed most frequently or on the data associated with the hotspots of the application. Different phases of the computation may operate on different data structures or demand different decompositions for the same data structures. In this case, treat each phase separately and then determine how the decompositions and parallel algorithms developed for each phase fit together. In the above grid for an ocean modeling application, divide the large data set of grid points into blocks and assign those blocks to individual threads. The grid point data set is both the largest and the most frequently accessed data structure. For data-decomposition problems, the OpenMP model is easily used because these are often situations where you need to assign multiple threads to perform the same functionality on different data. Data parallelism is a common technique in high performance computing (HPC) applications. Clarify their queries (if any) related to this slide. Example: Ocean Modeling Application 2018/9/20

Pipelined Decomposition
Notes on Pipelined Decomposition: Pipelined Decomposition method is an alternative method of multithreading that effectively meets the conditions and challenges of data and task decomposition. In the pipelined decomposition method, individual threads can perform computations in independent stages in a process. In a functional pipeline, you can assign a different thread to compute data at each stage of the pipeline. In a data pipeline, threads processes all stages on a single data instance. Introduce pipelined decomposition, which can apply to either task or data decomposition. The Pipelined Decomposition method is an alternate technique to divide and organize data and computation across threads within applications that have a specific structure. In the pipelined decomposition method, individual threads can perform the computations of a sequence of independent stages. Consider an algorithm where processing occurs sequentially in stages such as stage 1, stage 2, stage 3, and so on. The results of stage N are used as the input for stage N+1. This arrangement of computation and data passing is known as a pipeline. You can divide the stages in the pipeline and assign threads based on functional or data decomposition. In a functional pipeline, you assign a separate thread to compute data at each stage of the pipeline. The data then passes between the stages. When a thread completes the work in the stage that the thread has been assigned, it passes the data to the other thread. This resembles the automobile assembly line, where each worker is a thread, and building each part of a car represents a task. Each worker adds its own part to the car and passes it on to the next worker to add the next portion. In a data pipeline, each thread processes all stages on a given data instance. Consider an automobile assembly line where a single worker builds an entire car. A worker passing through all the stages of the assembly works on the same car until completion does this. Using a functional pipeline or a data pipeline may depend on the dependencies of data between the stages of the pipeline. If there are no dependencies, and each data instance can be worked on in parallel, a data pipeline will work. If there are dependencies as data passes between stages, such as the results of the Nth data instance are needed to process the (N+1)th instance for a given stage, a functional pipeline will be needed. In a functional pipeline, ensure that correct pipelined data sharing occurs through overlapping critical sections. Bind each pipeline or processing stage of the application by an overlapping Critical Section entry or exit. Consider an airlock on a spacecraft. To enter from space, you must enter through the external door and close it. After pressurizing the airlock, you can open the inner door and enter the capsule. Similarly, data must be available from the previous stage before the thread controlling the current stage can begin computations. Clarify their queries (if any) related to this slide. 2018/9/20

Example − LAME Encoder What is LAME Encoder?
LAME Encoder is an example of an application that uses pipelined decomposition. MP3 encoder, called LAME, is an open source project. LAME is an educational tool used for learning about MP3 encoding. Goals of LAME project: Improve the psychoacoustics quality of MP3 encoding. Improve the speed of MP3 encoding. Introduce a particular application, the LAME audio encoder, to set up the next slide showing LAME in a pipelined decomposition. MPEG Layer III, also known as MP3, is the most popular format for PC-based audio encoding-decoding today. MP3 encoders produce encoded audio files that are smaller than un-encoded digital audio files. They do so by encoding only sound frequencies that can be perceived by human hearing. MP3 audio encoders accept a wave file as input either from a user's local disk or ripped from a CD, and produce an MP3 file as output. The Lame MT project with full description and source code is available online at Clarify their queries (if any) related to this slide. 2018/9/20

LAME Pipeline Strategy
Prelude Acoustics Encoding Other Fetch next frame Frame characterization Set encode parameters Psycho analysis FFT long/short Filter assemblage Apply filtering Noise shaping Quantize and count bits Add frame header Check correctness Write to disk Frame Frame N Frame N + 1 Time Other N Prelude N Acoustics N Encoding N T 2 T 1 Acoustics N+1 Prelude N+1 Other N+1 Encoding N+1 Acoustics N+2 Prelude N+2 T 3 T 4 Prelude N+3 Hierarchical Barrier Show how the LAME compute sequence maps to a pipelined threading approach. The MP3 process that LAME employs has at least four stages, such as prelude, acoustics, encoding, and other. In the prelude stage, you input the data, get the next frame, and set some of the encoding parameters. In the next stage, acoustics, you perform psychoanalysis. Check what frequencies are out of range, use method Ts and filters. Then you perform the encoding in the encoding stage. Shape the noise, apply filters to the range, cut off the numbers outside the range, and quantize the bits. In the other stage, write the computations to the disk, add frame headers, and check for correctness. You perform functional decomposition of the LAME Encoder. Assign four threads, one per stage and start sending these frames of music down the pipeline to be encoded. At some time, thread 1 starts work on the prelude of frame N. Then, thread 1 can work on the next frame and pass the prelude stage results to thread 2, which starts working on the acoustics portion of the Nth frame. Then, encoding starts in thread 3 for the Nth frame. In the next stage, thread 4 computes the output. With the pipeline method, you can work on parts of four different frames with four threads concurrently. Clarify their queries (if any) related to this slide. Illustration of the Computation in the LAME Encoder Pipeline Strategy 2018/9/20

Rapid prototyping with OpenMP
Design − Prime Number Generation Consider the code for finding prime numbers: What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design or effort is required? Speedup(2P) = 100/(96/2+4) = ~1.92X Rapid prototyping with OpenMP Return to the example for finding prime numbers to approach the design stage. Introduce OpenMP as a prototyping thread model. Although OpenMP is introduced for prototyping, it may (of course) prove efficient enough to be the thread model of choice for this example. Questions to Ask Students Where does this 2P speedup claim come from? Clarify their queries (if any) related to this slide. 2018/9/20

! Review OpenMP Definition:
OpenMP, Open specifications for Multi Processing, is an application program interface (API) used for writing portable, multithreaded applications. OpenMP has the following features: Provides a set of compiler directives—data embedded in source code—for multithreaded programming Combines serial and parallel code in single source Supports incremental parallelism Supports coarse-grain and fine-grain parallelism Supports and standardizes loop-level parallelism Review the concepts on OpenMP. OpenMP was launched as a standard Industry collaborators included Intel, but not Microsoft (who were invited but not interested at the time); Microsoft now (2006) supports OpenMP in its compilers. Introduce OpenMP by stating that OpenMP, Open specifications for Multi Processing, is an application program interface (API) that was formulated in It is used for writing portable, multithreaded applications. OpenMP has made it easy to create threaded Fortran and C/C++ programs. Unlike Windows threads, OpenMP does not require you to create, synchronize, load balance, and destroy threads. OpenMP is not formally accepted as an industry standard for programming, but it is used widely. OpenMP provides a set of compiler directives—data embedded in source code—that informs compilers about the intent of compilation. Compiler directives tell the compiler how to compile. For example, the # include <stdio.h> compiler directive informs the compiler to include the stdio.h file and compile the code based on functions related to this header file. Earlier, software vendors had their own sets of compiler directives. OpenMP standardizes almost 20 years of compiler-directed threading experience by combining all the directives from various vendors. OpenMP is compiler-directed. If the compiler does not recognize a directive, the compiler ignores the directive and the code enclosed within the directive. OpenMP provides a special syntax that programmers can use to create multithreaded code and avoid rewriting the threaded version of the code. It combines serial and parallel code in a single source. OpenMP also supports incremental parallelism. Incremental parallelism allows you to first modify a segment of the code and test it for better performance. You can then successively modify and test the other code segments. This helps you give priority to critical problems in a code segment. OpenMP also supports both coarse-grain and fine-grain parallelism. Coarse-grain parallelism means individual tasks are relatively big in terms of code size and execution time, whereas fine-grain parallelism means individual tasks are relatively small in terms of code size and execution time. OpenMP also supports and standardizes loop-level parallelism. It uses special syntax to parallelize loops where hotspots or bottlenecks appear. As a result, threads can execute different iterations of the same loop simultaneously. For the participants who want to learn more about OpenMP, recommend them to visit Highlight that the current specification of OpenMP is OpenMP 2.5. Clarify their queries (if any) related to this slide. ! Note For more information on OpenMP, visit The current specification of OpenMP is OpenMP 2.5. 2018/9/20

Review OpenMP (Continued)
Fork-join model is the parallel programming model for OpenMP where: Master thread creates a team of parallel worker threads. Parallelism is added incrementally; the sequential program evolves into a parallel program. Master Thread Parallel Regions F O R K J I N Introduce the slide by saying that OpenMP is based on the fork-join model that facilitates parallel programming. Every threading activity in OpenMP follows the fork-join model. Explain the fork-join model with the help of the figure on the slide: All programs based on the fork-join model begin as a single process. In the figure, the application starts executing serially with a single thread called the master thread. The execution of a program following the fork-join model proceeds through two stages: Fork: When the master thread reaches a parallel region or a code segment that must be executed in parallel by a team of threads, it creates a team of parallel threads called worker threads that execute the code in the parallel region concurrently. Join: At the end of the parallel region, the worker threads synchronize and terminate, leaving only the master thread. In OpenMP implementation, threads reside in a thread pool. A thread pool refers to a common pool of threads from which the threads are assigned to the tasks to be processed. The thread that is assigned to a task completes the task and returns to the pool to wait for the next assignment without terminating. When the threads are assigned to a parallel region that has more work than the earlier one, additional threads are added at run time from the thread pool. Threads are not destroyed until the end of the last parallel region. A thread pool can prevent your machine from running out of memory due to continuous creation of threads in your program. The fork-join method enables incremental parallelism. You do not need to thread the entire algorithm. For example, if there are three hotspots in the code, you can concentrate on the hotspots one by one based on their severity levels. Clarify their queries (if any) related to this slide. Fork-Join Model 2018/9/20

OpenMP in Prime Number Generation
Consider a specific syntax of OpenMP implemented into the code for finding prime numbers: OpenMP Create threads here for this parallel region Divide iterations of the for loop #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Show a specific syntax of OpenMP implemented into the primes code. This is simply a first attempt to thread the FindPrimes() function. Since this function has been identified as a potentially beneficial point in the code for threading and the for-loop here is able to be used with OpenMP, this is what is attempted first. The original source code is not modified because this is introduced by pragmas. Native thread methods require changes to the serial sources. Note that the parallel region, in this case, is the for loop. The single pragma does the following two things in this case: To create a parallel region that will be executed by threads. This is signified by the parallel keyword in the pragma. Thus, threads will become active (created if they have not already been done so, or woken up if the threads have been put to sleep between parallel regions) and begin execution. The second thing that is accomplished with the pragma is to divide the iterations among the threads. If this for keyword was not present, each thread would execute all iterations of the loop. However, since you only want to test each potential prime once, dividing the iterations among the threads gives the chance to execute in parallel and perform all the work needed, but no more. Division of the iterations will be done by the default method defined in the user's OpenMP implementation. Most likely this will be done as if schedule(static) were specified and divide the iterations into a number of chunks equal to the number of threads. Each thread will be assigned one chunk of consecutive iterations. The final slide build shows results (number of primes and total time) for the image created with this pragma. Clarify their queries (if any) related to this slide. 2018/9/20

Build and run the OpenMP threaded version of the code for finding prime numbers. Steps Involved: Locate PrimeOpenMP directory and solution. Compile the code. Run the code with ‘ ’ range for comparison. Find the speedup. Introduction to the Third Lab Activity, whose purpose is to build and run an OpenMP version of primes. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objective. 2018/9/20

Speedup of 1.40X (less than 1.92X)
Result Analysis of Activity 3 Questions for discussion after compiling and running the OpenMP threaded version of the code for finding prime numbers: What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design or effort is required? Is this the best speedup possible? Speedup of 1.40X (less than 1.92X) Discuss the results obtained in the previous lab activity, whose purpose was to build and run an OpenMP version of primes. Speedup was lower than expected speedup. Clarify their queries (if any) related to this slide. 2018/9/20

No! The answers are different each time …
Debugging for Correctness Results after running the OpenMP threaded code for finding prime numbers: No! The answers are different each time … Introduce and stress on the importance of correctness. In the example shown, each run produces a different number – the bug is non-deterministic. On some platforms, the answer may be correct 9/10 times. Students can test their own implementation (previous lab) on multiple runs. Clarify their queries (if any) related to this slide. Is this threaded implementation right? 2018/9/20

Implementing Intel® Thread Checker
Intel® Thread Checker pinpoints threading bugs like data races, stalls, and deadlocks. Intel® VTune™ Performance Analyzer Intel Thread Checker Primes.exe (Instrumented) Binary Instrumentation Primes.exe Runtime Data Collector +DLLs (Instrumented) Introduce Intel Thread Checker as a tool to address threading correctness and outline its implementation. The code is typically instrumented at the binary level, though source instrumentation is also available. To find threading diagnostics including errors, Intel Thread Checker library calls record information about threads, including memory accesses and APIs used. Binary instrumentation is added at run-time to an already built binary module, including applications and dynamic or shared libraries. The instrumentation code is automatically inserted when you run an Intel Thread Checker activity in the VTune environment or the Microsoft .NET Development Environment. Both Microsoft Windows and Linux executables can be instrumented for IA-32 processors, but not for Itanium® processors. Binary instrumentation can be used for software compiled with any of the supported compilers. The final build shows the UI page moving ahead to the next slide. Clarify their queries (if any) related to this slide. threadchecker.thr (result file) Intel Thread Checker Addressing Threading Correctness 2018/9/20

Implementing Thread Checker (Continued)
Introduce Intel Thread Checker as a tool to address threading correctness and outline its implementation. Clarify their queries (if any) related to this slide. Intel® Thread Checker Source Code View 2018/9/20

Use Intel® Thread Checker to analyze the threaded application and check for threading errors. Steps Involved: Create the Intel Thread Checker activity. Run the application. Are any errors reported? Introduction to the Fourth Lab Activity, whose purpose is to run the Intel Thread Checker analysis illustrated on the previous slide. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objectives. 2018/9/20

Debugging for Correctness
How much re-design or effort is required? How long would it take to correct? Intel Thread Checker reported only 2 dependencies, so effort required should be low. Address the question about the effort in terms of cost, which will be required to successfully thread this application. As asserted in the slide, with only the two dependencies (gPrimesFound() function and gProgress() function), the debugging effort should be manageable. Clarify their queries (if any) related to this slide. 2018/9/20

Correcting Data Races #pragma omp parallel for
Way to correct the race conditions on the gPrimesFound() and the gProgress() functions: #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Create a Critical Section for this reference. Create a Critical Section for both these references. Show one way to correct the race conditions on the gPrimesFound() function and the gProgress() function. Critical Sections can only be accessed by one thread at a time, so this solution should correct the race condition. Note the key point, one thread at a time – the Critical Section is, by design, no longer parallel. Clarify their queries (if any) related to the objectives. 2018/9/20

Correct the threading errors identified by Intel® Thread Checker. Modify and run OpenMP version of the code for finding prime numbers. Steps Involved: Add critical region pragmas to the code. Compile the code. Run the application from within Intel Thread Checker. If errors still present, make appropriate fixes to the code and run again in Intel Thread Checker. Run the code with ‘ ’ range for comparison. Compile the code and run the application outside Intel Thread Checker. What is the speedup? Introduction to the Fifth Lab Activity, whose purpose is to correct the race conditions discovered in the primes code. The resulting image is then checked for results and performance. Explain the participants the objectives for the activity. Clarify their queries (if any) related to the objectives. 2018/9/20

No! From Amdahl’s Law, expected speedup is close to 1.9X.
Fixing the Bugs Adding critical region pragmas to the code has fixed the threading errors. However, the performance has slipped to ~1.33X. Is this the best that can be expected from this algorithm? Show that the Critical Sections method fixed the bug, but the performance is lower than expected. The slide shows a correct answer, but remind the students that this does not guarantee there is no bug. For example, race conditions, if present, may show up only rarely. To be more rigorous, one would re-run the Intel Thread Checker. Clarify their queries (if any) related to this slide. No! From Amdahl’s Law, expected speedup is close to 1.9X. 2018/9/20

Common performance issues identified by Intel® Thread Profiler are: Load Imbalance Improper distribution of parallel work. Contention on Synchronization Objects Excessive use of global data, contention for the same synchronization object. Parallel Overhead Overhead due to thread creation and scheduling. Granularity No sufficient parallel work. List some common performance issues. Revise all the concepts about threading issues such as load imbalance, contention on synchronization objects, parallel overhead, and granularity. Clarify their queries (if any) related to this slide. 2018/9/20

Implementing Intel® Thread Profiler
Intel® Thread Profiler pinpoints performance bottlenecks in threaded applications. Primes.c Intel® Thread Profiler Intel® VTune™ Performance Analyzer Compiler Source Instrumentation Primes.exe (Instrumented) Binary Instrumentation +DLL’s (Instrumented) /Qopenmp_profile Primes.exe Runtime Data Collector Bistro.tp/guide.gvs(result file) Introduce Intel Thread Checker as a tool to address threading performance issues. Intel Thread Profiler pinpoints performance bottlenecks in threaded applications. When you compile a program containing OpenMP directives and use the /Qopenmp option during the compilation, the Intel compiler translates your OpenMP directives into code and subroutine calls to the OpenMP Runtime Engine. These calls implement the multithreading capabilities you want. All threads are created automatically. Distribution of work among threads is also assigned automatically. The Runtime Engine responds to the settings of several environment variables that control how the work is scheduled and assigned to various threads, what output intervals to use, how many threads to use, and so forth. If you use the /Qopenmp_profile option instead, the normal OpenMP Runtime Engine is replaced by an OpenMP statistics gathering Runtime Engine. Before you begin, you need to link and instrument your application by using calls to the OpenMP statistics gathering Runtime Engine. The Runtime Engine's calls are required because they collect performance data and write it to a file. Compile your application by using an Intel Compiler and then link your application to the OpenMP Runtime Engine by using the -Qopenmp_profile option. After compiling and linking your application, following happens when you execute your application: Your application operates in parallel. The special OpenMP Runtime Engine creates a .gvs file containing data about the operation of your program. Clarify their queries (if any) related to this slide. Intel Thread Profiler Addressing Threading Performance 2018/9/20

Views — Intel® Thread Profiler for OpenMP
The different views available with Intel® Thread Profiler for OpenMP interface are: Summary View Regions View Threads View Data Editor View Introduce the different views available with Intel Thread Profiler for OpenMP interface. In Intel Thread Profiler, there is a different interface when examining OpenMP codes versus analyzing codes that use explicit threads. You can use Intel Thread Profiler to view your application’s data. The VTune Performance Analyzer displays the Intel Thread Profiler performance data in four tabbed views. Each view provides a different perspective on your data and helps you identify regions with a good potential for performance speedup. By using the OpenMP interface of Intel Thread Profiler, you can find the most noticeable performance problem. Clarify their queries (if any) related to this slide. 2018/9/20

Summary View of OpenMP Interface
Introduce the Summary View of OpenMP interface for Intel Thread Profiler. Summary View of OpenMP interface is the default view that appears when you run Intel Thread Profiler. This view shows program-wide time distribution along with estimated speedups. Each color in the histogram corresponds to the types of activities. For example, green corresponds to the execution time in parallel regions, red corresponds to load imbalance time between threads, orange corresponds to the time spent with threads that use locks, grey corresponds to time spent in a synchronized state, and yellow corresponds to parallel overhead time. When you run Intel Thread Profiler for the code for finding prime numbers, the Summary and the Threads View of OpenMP interface displays the amount of synchronization time. This is the most prominent performance problem in this code. Clarify their queries (if any) related to this slide. Summary View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Summary View of OpenMP Interface (Contd.)
Speedup Graph Estimates threading speedup and potential speedup based on Amdahl’s Law computation. Provides upper and lower bound estimates. Introduce and discuss points of the Intel Thread Profiler interface by using results from the code for finding prime numbers. You can see the speedup estimate graph on the right side in the Summary View. The graph uses Amdahl's Law to project an estimate of the speedup that you can theoretically apply in your program or region. The graph uses data from the actual profile and extrapolates it to higher thread numbers by using Amdahl’s law. The green line represents the upper estimate of the parallel speedup. The red line represents the lower bound of the speedup based on the particular run of Intel Thread Profiler. In the Summary View, you can open multiple Activity results at a time. When you make code adjustments, you can run Intel Thread Profiler for an application each time. Then, you can drag multiple Activity results to compare your application's performance across multiple results that you obtain after each modification. Each histogram represents a different Activity result. You can see whether the changes to source code have the desired effect of improving performance. Summary View displays an aggregated summary of the performance for the entire application. You can use the Regions View to examine the performance characteristics of individual parallel and serial regions with the application. You can display the Regions View by using the Regions tab. Clarify their queries (if any) related to this slide. Summary View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Regions View of OpenMP Interface
serial parallel Introduce and discuss the Regions View of OpenMP interface of Intel Thread Profiler. The Regions View displays the performance of each region in all active Activity results. Intel Thread Profiler divides the OpenMP application into the serial and parallel regions in the Regions View. The letter S in the names of the regions in the Region View denotes the serial regions. In the figure, A0 represents the Activity 0, S1 represents the first serial region, and S2 represents the second serial region. The letter R denotes the parallel region. The parallel regions are interleaved with the serial regions. In codes where you have various parallel regions, the performance problems may be restricted to one or two small subsets of the parallel regions. In such a case, you do not need to modify the entire code. Instead, locate those specific regions of the code that have those performance problems and modify only those regions of the code. Intel Thread Profiler for OpenMP provides you that opportunity by providing the Regions View. For example, you can find regions that have load imbalance issues, locks, and so on. By using the OpenMP analysis, the obvious question is about the most noticeable performance problem that the tool displays. Another view available in the OpenMP interface is the Threads View. You can view the Threads View by using the Threads tab. Clarify their queries (if any) related to this slide. Regions View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Threads View of OpenMP Interface
Introduce and discuss the Threads View of OpenMP interface for Intel Thread Profiler. The Threads View displays a summary of the performance of each thread. Each bar in the view represents the time used by a specific thread during the execution of your program. The display may be program-wide or a sum of the execution times in a subset of the program regions. Clarify their queries (if any) related to this slide. Threads View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Profile View for Explicit Threads
Continue the introduction and discussion of the Intel Thread Profiler interface by using results from the code for finding prime numbers. By using the OpenMP interface of Intel Thread Profiler, you can find the most noticeable performance problem. However, it does not have the ability to point to specific synchronization objects or code lines within an OpenMP parallel region. Therefore, you need to use Intel Thread Profiler for Explicit Threads to analyze the OpenMP code. When you run Intel Thread Profiler for the code for finding prime numbers, you will find overhead time that did not appear in the OpenMP views You will find overhead time that did not appear in the OpenMP views. The above figure illustrates the Profile View for Explicit (Windows) Threads of the OpenMP code. Clarify their queries (if any) related to this slide. Timeline View for Intel® Thread Profiler 2018/9/20

Why so many transitions?
Transition Source View for Explicit Threads Why so many transitions? Continue the introduction and discussion of the Intel Thread Profiler interface by using results from the code for finding prime numbers. In addition to displaying the overhead time, the Timeline View has some interesting anomalies between active and less active times. If you zoom in and check the Transition Source View, you will see that the active time corresponds to the synchronization. However, the less active time transitions of the critical path can be traced down to the printf statement in the ShowProgress() function. Questions for Discussion: If you are only printing out once for every progress update, why is the critical path bouncing around the two threads so many times? Why is there synchronization in the printf that causes the critical path to bounce back and forth? Clarify their queries (if any) related to this slide. Transition Source View for Intel® Thread Profiler 2018/9/20

Back to the design stage.
Tuning for Performance This implementation has implicit synchronization calls. This limits scaling performance due to the resulting context switches. Continue the introduction and discussion of the Intel Thread Profiler interface by using results from the code for finding prime numbers. Clarify their queries (if any) related to this slide. Back to the design stage. 2018/9/20

Use Intel® Thread Profiler to analyze the threaded application. Use Intel Thread Profiler to find any problems that might be hampering the parallel performance. Steps Involved: Use /Qopenmp_profile option to compile and link the code. Create Intel Thread Profiler Activity for explicit threads. Run the application from within Intel Thread Profiler. Find the line in the source code that causes the threads to be inactive. Introduction to the Sixth Lab Activity, whose purpose is to run Intel Thread Profiler analysis on the code for finding prime numbers. Explain the participants the objectives for the activity. Clarify their queries (if any) related to the objectives. 2018/9/20

This change should fix the contention issue.
Fixing the Performance Issues Is that much contention expected? void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone); } Address the performance problem identified in the preceding slides and lab. The test, if (percentDone % 10 ==0), does NOT prints every 10th step, but much more often. The slide build introduces a fix. Questions for Discussion: Why is the original algorithm not printing as infrequently as intended? Why does the fix correct this problem? Clarify their queries (if any) related to this slide. This change should fix the contention issue. The algorithm has many more updates than the 10 needed for showing the progress of the function. 2018/9/20

Speedup is 2.32X ! Is that right?
Design − Fixing the Performance Issues Goal: Eliminate the contention due to implicit synchronization. Speedup is 2.32X ! Is that right? Address the performance problem identified in the preceding slides and lab. The number of prints is more than expected due to the truncation in calculating the integer percentage progress. As an example, if the code is testing 500,000 odd numbers, one percent of the numbers will have been done after 5000 numbers have been tested. Hence, the progress will not roll over from 10% to 11% until 5000 numbers have been tested and each number tested of that 5000 will cause the printf to execute since each progress computation is truncated to 10%. If there are 10 progress updates and each one prints 5000 times, the total number of time the printf is executed will be 50,000 times. This is much more than the expected 10 prints. The printf function is implemented with mutual exclusion (implicit synchronization) to ensure that the print is completed from one thread before another thread's print is begun because you are using the thread safe version of the runtime library. For all of the above, the ShowProgress() function needs to be modified to only print the expected 10 times. Clarify their queries (if any) related to this slide. 2018/9/20

Speedup is actually 1.40X (<<1.9X)!
Performance − Corrected Baseline Timing Original baseline measurement had the flawed progress. Therefore, update the algorithm. Is this the best you can expect from this algorithm? Address the performance problem identified in the preceding slides and lab. Clarify their queries (if any) related to this slide. Speedup is actually 1.40X (<<1.9X)! 2018/9/20

Correct the problem of implicit synchronization issue from calling printf more times than necessary. Steps Involved: Modify ShowProgress() function (both serial and OpenMP) to print only the needed output. Recompile and run the code. Ensure that no instrumentation flags are used. What is the speedup from the serial version now? if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10) { printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } Introduction to the Seventh Lab Activity, whose purpose is to introduce the performance fix outlined in the preceding slides. After the changes are made to the ShowProgress() function are done, the threaded code appears to run in super-linear time. This is due to the fact that the original serial code is still printing 50,000 times for 500,000 numbers tested. There is no synchronization that overhead for the function calls can add significant time to the serial code execution. Lab 7 implements the proposed solution in both serial and threaded code. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objective. 2018/9/20

Still have 62% of execution time in locks and synchronization.
Performance Revisited Still have 62% of execution time in locks and synchronization. Show the result of Intel Thread Profiler analysis on the newly-corrected primes code and explain that performance problems persist. Now, re-run Thread Profiler to see what the next most prominent performance problem will be. This turns out to be a very large percentage of synchronization time to overall run time. In the OpenMP analysis views, the printf synchronization appears to have been hidden in the parallel execution time. This is due to the binary instrumentation not looking for explicit synchronization outside of the OpenMP framework. Question for Discussion: Can something be done with the explicit synchronization that is needed in the code? The next few slides attempt to follow some of the advice given when looking at synchronization and how to minimize the use and need for synchronization. The code is modified by using local variables and reducing the amount of code needed to be put in critical regions. Clarify their queries (if any) related to this slide. Summary View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Understanding the Performance Impact
To understand the performance impact, examine the lock protecting gPrimesFound() function. Look at the OpenMP locks: void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Lock is in a loop To understand the performance impact, examine the lock protecting gPrimesFound() function and gprogress() function. Clarify their queries (if any) related to this slide. 2018/9/20

This lock is also being called within a loop.
Understanding Performance Impact (Contd.) To understand the performance impact, examine the lock protecting gProgress() function. Look at the second lock: void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; This lock is also being called within a loop. To understand the performance impact, examine the lock protecting gPrimesFound() function and gprogress() function. Clarify their queries (if any) related to this slide. 2018/9/20

Modify the explicit critical regions to reduce contention and limit the amount of time threads spend in each region. Steps Involved: Modify OpenMP critical regions to use InterlockedIncrement instead. Recompile and run the code. What is the speedup from the serial version now? Introduction to the Eighth Lab Activity, whose purpose is to introduce the code change cited and measure its impact. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objective. 2018/9/20

Consider a four thread example.
Examine the Causes of Load Imbalance Consider a four thread example. Show a summary profile for the performance of the code for finding prime numbers (as modified and improved to this point in the presentation), running on two threads. Run Thread Profiler again. Highlight that after any changes to the threading for performance reasons, you should run the modified code through Intel Thread Checker in order to ensure that no new threading errors have been introduced. If there are errors, then these must be fixed before proceeding to Intel Thread Profiler. The most obvious performance problem is now load imbalance. The algorithmic explanation for the load imbalance is given in the next slide a solution is proposed in the one after that. The performance analysis of the code for finding prime numbers reveals that there are load imbalance issues in the code. Some functions comparatively perform more task than others or take more time to execute the task. The figure clearly displays the load imbalance issue between the two threads. One thread is busy executing throughout, whereas the other thread is idle for sometime. Now, consider the same example by using four threads. Now, consider the same example by using four threads. Clarify their queries (if any) related to this slide. Threads View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Examine Causes of Load Imbalance (Contd.)
Thread 0 342 factors to test 500000 250000 750000 Thread 1 612 factors to test Thread 2 789 factors to test Examine the causes of the load imbalance observed in the profile of the primes code. The stair step illustrate that each successive thread takes additional time to finish executing their respective task. A triangle illustrates conceptually the nature of the workload, which increases with increasing number range. The precise workload is stated for each thread. It explicitly shows that there are more steps required as the code searches for prime numbers in sets of larger numbers. After you identify the load imbalance issue by analyzing the performance by running Intel Thread Profiler, you then need to fix the identified load imbalance issues. Clarify their queries (if any) related to this slide. Thread 3 934 factors to test Threads View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Fix Load Imbalance Issues
To fix the load imbalance, distribute the work evenly among the threads. void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } Speedup achieved is 1.68X. Introduce a method to address the load imbalance inherent in the algorithm for finding prime numbers. To balance the workload between threads, distribute the work more evenly among the threads. The first triangle shows the unequal distribution of work among the threads, while the second triangle shows the workload interleaved to achieve a more even distribution. You can now add an OpenMP schedule clause in the code for finding prime numbers to achieve a more even distribution. This in turn balances the workload among the threads and increases the execution speed of the program. Clarify their queries (if any) related to this slide. 2018/9/20

Correct the load balance issue between threads by distributing loop iterations in a better fashion. Steps Involved: Modify the code. Add schedule (static, 8) clause to OpenMP parallel for pragma. Recompile and run the code. What is the speedup from the serial version now? Introduction to the Ninth Lab Activity, whose purpose is to introduce static OpenMP scheduling and measure its impact. Explain the participants the objective for the activity. Clarify their queries (if any) related to the objective. 2018/9/20

Final Intel® Thread Profiler Run
Speedup achieved is 1.80X. Show the performance profile of the final version of code for finding prime numbers with all corrections and load balancing implemented. Clarify their queries (if any) related to this slide. 2018/9/20

Summary View of OpenMP Interface for Intel® Thread Profiler
Comparative Analysis Threading applications require multiple iterations of going through the software development cycle. Summarize the results at each step in the performance tuning process and emphasize the iterative nature of the process. You can run your application with different configuration settings, such as different numbers of threads, or by using the throughput mode with the OpenMP Runtime Engine instead of the turnaround mode and comparing the results to measure the performance improvements between the different runs. Clarify their queries (if any) related to this slide. Summary View of OpenMP Interface for Intel® Thread Profiler 2018/9/20

Summary You need to identify of the region of the code to thread, the time and redesign effort required to thread a region. You should also consider the expected speedup and performance after threading, the threading model, and the scalability of the threaded application. You can use the Intel® VTuneTM Performance Analyzer in the analysis phase to find code hotspots by using Sampling and Call-Graph analyses. Ian Foster’s design methodology for parallel programming structures the design process as four distinct stages, such as partitioning, communication, agglomeration, and mapping. In the data decomposition approach, decompose the data associated with a problem into small pieces of approximately equal size and then associate the computation that is to be performed on the data. Functional decomposition involves dividing the computation into smaller tasks and then associating the corresponding data to each. Summarize all the key points learned in the chapter. 2018/9/20

Summary (Continued) In the pipelined decomposition method, individual threads can perform computations in independent stages in a process. The pipelined decomposition can apply to either task or data decomposition. In the design and implementation phase, you can use explicit threading methods such as Windows threads or POSIX threads that use library calls to create, manage, and synchronize threads. You can use Intel® Thread Checker and Intel® Debugger to debug the multithreaded applications for correctness during the debug phase. During the tuning phase, Intel® VTuneTM Performance Analyzer and Intel® Thread Profiler help you tune a multithreaded application. These threading tools help you by making it easy to see and probe the activities of all threads on a system. Summary View, Regions View, Threads View, and Data Editor View are the four views available with Intel Thread Profiler for OpenMP interface. Summarize all the key points learned in the chapter. 2018/9/20

Summary (Continued) The Summary View provides a program-wide look at performance results for the selected Activity results. The Threads View displays the performance of each thread in all active Activity results. The Regions View displays the performance of each region in all active Activity results. The Data Editor View enables you to play what-if games with your data. It shows you how changing numbers in your data results affects the Speedup Plot. You can run your application with different configuration settings, such as different numbers of threads and comparing the results to measure the performance improvements between the different runs. Explicit threading is usually best when the amount of work scales with the number of independent tasks, and compiler-directed threading is usually best when the amount of work scales with the amount of data. Summarize all the key points learned in the chapter. 2018/9/20

Prof. Chih-Hung Wu Dept. of Electrical Engineering

Similar presentations

Presentation on theme: "Prof. Chih-Hung Wu Dept. of Electrical Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prof. Chih-Hung Wu Dept. of Electrical Engineering

Similar presentations

Presentation on theme: "Prof. Chih-Hung Wu Dept. of Electrical Engineering"— Presentation transcript:

Similar presentations

About project

Feedback