Prof. Chih-Hung Wu Dept. of Electrical Engineering

Slides:

Advertisements

Similar presentations

Operating System Concepts and Techniques Lecture 12 Interprocess communication-1 M. Naghibzadeh Reference M. Naghibzadeh, Operating System Concepts and.

Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.

Chapter 2: Processes Topics –Processes –Threads –Process Scheduling –Inter Process Communication (IPC) Reference: Operating Systems Design and Implementation.

Concurrent Processes Lecture 5. Introduction Modern operating systems can handle more than one process at a time System scheduler manages processes and.

Chapter 1 and 2 Computer System and Operating System Overview

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

CS533 - Concepts of Operating Systems

Computer Organization and Architecture

Operating System A program that controls the execution of application programs An interface between applications and hardware 1.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Multi-core Programming: Basic Concepts. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Introduction to Concurrency.

1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.

Games Development 2 Concurrent Programming CO3301 Week 9.

Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.

Processor Architecture

Copyright © Curt Hill Concurrent Execution An Overview for Database.

Practice Five Chapter Seven.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Embedded Computer - Definition When a microcomputer is part of a larger product, it is said to be an embedded computer. The embedded computer retrieves.

Parallel Computing Presented by Justin Reschke

Agenda  Quick Review  Finish Introduction  Java Threads.

Introduction to operating systems What is an operating system? An operating system is a program that, from a programmer’s perspective, adds a variety of.

Tutorial 2: Homework 1 and Project 1

Chapter 4 – Thread Concepts

Process Management Deadlocks.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Memory Management.

Processes and threads.

Distributed Processors

Operating Systems (CS 340 D)

Chapter 5a: CPU Scheduling

G.Anuradha Reference: William Stallings

Chapter 4 – Thread Concepts

ITEC 202 Operating Systems

Chapter 9 – Real Memory Organization and Management

Computer Engg, IIT(BHU)

The University of Adelaide, School of Computer Science

Introduction to Operating System (OS)

EE 193: Parallel Computing

Central Processing Unit

Chapter 6: CPU Scheduling

Chapter 5: CPU Scheduling

Operating Systems.

CSE8380 Parallel and Distributed Processing Presentation

What is Concurrent Programming?

Threads Chapter 4.

Dr. Mustafa Cem Kasapbaşı

Chapter 6: CPU Scheduling

Multithreaded Programming

Concurrency: Mutual Exclusion and Process Synchronization

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

Lecture 2 The Art of Concurrency

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Chapter 6: Synchronization Tools

Programming with Shared Memory Specifying parallelism

Operating System Overview

COMP755 Advanced Operating Systems

EECE.4810/EECE.5730 Operating Systems

Chapter 3: Process Management

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Parallel Processing and Multi-Core Programming in the Cloud-Computing Age Prof. Chih-Hung Wu Dept. of Electrical Engineering National University of Kaohsiung Email: johnw@nuk.edu.tw URL: http://www.johnw.idv.tw Multi-Core Programming for Windows. Note: Part of this PPT file is from Intel Software College (Intel.com) 2018/9/19

Agenda Course Introduction Multithreaded Programming Concepts Tools Foundation – Intel® Compiler and Intel® VTuneTM Performance Analyzer Programming with Windows Threads Programming with OpenMP Intel® Thread Checker – A Detailed Study Intel® Thread Profiler – A Detailed Study Threaded Programming Methodology Course Summary Provide participants the agenda of the chapters to be covered in this course. 2018/9/19

Agenda Course Introduction Multithreaded Programming Concepts Tools Foundation – Intel® Compiler and Intel® VTuneTM Performance Analyzer Programming with Windows Threads Programming with OpenMP Intel® Thread Checker – A Detailed Study Intel® Thread Profiler – A Detailed Study Threaded Programming Methodology Course Summary Provide participants the agenda of the topics to be covered in this chapter. 2018/9/19

Chapter 1 Multithreaded Programming Concepts 2018/9/19

Evolution of Multi-Core Technology Hyper Threading Dual-Core Processors Multi-Core Processors Instructional Level Parallelism Data Level Parallelism Thread Level Parallelism Single Core Processors Explain briefly the evolution of multi-core technology from single-core processors. Intel facilitates parallel computing and enhances the performance of the microprocessor architecture. Parallel computing involves using more than one computer or processor simultaneously to execute a program. In the early days of computing, operating systems could run only one program at a time; they were incapable of performing simultaneous tasks. For example, the system would stop processing when you printed a document. Innovations in the computing environment introduced multitasking in which the operating system could suspend one program and run another program. A short suspension time between two programs to execute gives the impression that multiple programs are executing simultaneously. In the past, performance scaling in single-core processors was achieved by increasing the clock frequency. When processors shrink and clock frequencies rise, the leakage of transistor current increases. This leads to excess power consumption and overheating. In addition, higher clock speeds lead to memory latency because memory access times have failed to keep pace with increasing clock frequencies. Therefore, continuously increasing the clock speed of large monolithic cores is not a viable long-term strategy. Since 1993, Intel processor designs have supported parallel execution at various levels. The first of these enhancements provided instruction-level parallelism through an out-of-order execution pipeline and multiple functional units to execute instructions in parallel. Data-level parallelism was enabled through vector registers and special instructions that could use these registers. Multimedia Extensions (MMX) were first introduced in 1997 and were quickly followed by Streaming SIMD Extensions (SSE), which have had several enhancements in instructions and hardware support. In 2002, Intel utilized additional copies of execution resources to execute two separate threads simultaneously on the same processor core. This technology was termed Hyper-Threading Technology, which paved the way for the next major step. In 2005, Intel introduced dual-core processor–based platforms that contain two full processing cores in a single chip. This trend will continue until there are several cores in a single chip. The multi-core architectures will allow performance to track Moore’s Law by addressing the frequency scaling problems mentioned previously. Clarify their queries (if any) related to this slide. Evolution of Multi-Core Technology 2018/9/19

What is a Thread? A thread: Is a single sequential flow of control within a program. Is a sequence of instructions that is executed. 2 x Single Thread Running Within a Program 2 x Multiple Threads Running Within a Program Present and explain the concept of a thread. A thread is a sequential flow of control within a program. It is a sequence of instructions that is executed. Every program has at least one thread called the main thread. This thread initializes the program and begins executing the initial instructions. This thread also holds the process information. The main thread can create multiple threads to perform different tasks. However, if the main thread terminates, the process also terminates and destroys any threads that may still be executing. Explain the difference between single-threaded and multithreaded programs. Clarify their queries (if any) related to this slide. 2018/9/19

Why Use Threads? Benefits of using threads are: Increased performance Better resource utilization Efficient data sharing Risks of using threads are: Data races Deadlocks Code complexity Portability issues Testing and debugging difficulty Threading can enhance the performance and responsiveness of a program. However, if used improperly, it can lead to degraded performance, unpredictable behavior, and error conditions. Therefore, applications should be threaded based on the need of the program and the execution capabilities of the deployment platform. Present the advantages of using threads: Increased Performance: Using threads increases performance and the ability to use multi-core processors. Threads share data faster because they share the same address space. Multi-core processors increase the computer’s capabilities and resources. Therefore, you can complete more tasks in less time by using multiple threads in multi-core processors. This leads to higher multithreaded throughput. Better Resource Utilization: Threads can help you effectively utilize the hardware resources. They can even reduce latency in single-core processors. For example, if a thread blocks a resource for some external event, such as a memory access or input/output (I/O), having another thread ready to execute will keep the processor busy. Efficient Data Sharing: Distributed memory systems or clusters share data through message-passing methods. These methods involve an active send and receive operation that the participating processes execute. The data is moved into kernel space through a network card across a wire through the receiving machine’s network card. The data is then moved into the kernel memory from where it enters the user memory space of the receiver. To use shared memory to share data, the sending thread writes to a common, agreed-upon location that the receiving thread subsequently reads. You need to coordinate this transfer to ensure that the write operation is executed before the read operation and not vice versa. Present the risks of using threads: There are various risks involved in using threads. Improper use of threads can lead to degraded performance, unpredictable behavior, and errors. There are also some algorithms and computations that are inherently serial. Therefore, not every application should be threaded. Adding threads increases the complexity of the code and thus the maintenance costs. Clarify their queries (if any) related to this slide. 2018/9/19

Multithreaded Program What is a Process? Modern operating systems load programs as processes. To the operating system, processes have two roles: Resource holder Execution (thread) unit Definition: A process has the main thread that initializes the process and begins executing the instructions. 2 x Explain the concept of a process. Modern operating systems load programs as processes. To the operating system, processes have two roles: resource holder and execution unit. Resource holder refers to the job of the process to hold memory, instruction pointers, file pointers, and other system resources assigned to the process. A thread is a single sequential flow of control within a process. Therefore, an execution unit is the thread that processes the program instructions and utilizes the resources. When the process is terminated, all resources are returned to the system. In addition, any active threads that might be running are terminated and the resources assigned to them, such as stack and other local storage, are returned to the system. Clarify their queries (if any) related to this slide. Process Multithreaded Program 2018/9/19

Processes and Threads Relationship of threads with a process: A process has the main thread that initializes the process and begins executing the instructions. Any thread can create other threads within the process. Each thread gets its own stack. All threads within the process share code and data segments. Stack Thread main() Stack Stack Thread 1 Thread N Code Segment Explain the concept of threads in a process. A process has the main thread that initializes the process and begins executing the instructions. Any thread can create other threads within the process. Each thread gets its own stack. All threads within the process share code and data segments. Clarify their queries (if any) related to this slide. Data Segment Threads in a Process 2018/9/19

Processes Vs Threads The following table lists the differences between threads and processes: Context switching Interaction Address space Category A thread shares the process address space with other threads. A process has its own address space, which the operating system protects. A thread interacts with the other threads of the program by using primitives of the concurrent program language or library within the shared memory of the process. A process interacts with other processes within operating system primitives and through shared locations in the operating system kernel. Context switching is light, because only the current register state needs to be saved. Context switching is heavy, due to the requirement that the entire process state must be preserved. Thread Process Present the differences between threads and processes. Clarify their queries (if any) related to this slide. 2018/9/19

Concurrency in Multithreaded Applications Definition: Concurrency occurs when two or more threads are in progress simultaneously. Concurrent threads can execute on a single processor. Thread 1 Thread 1 Thread 1 Thread 2 Thread 2 Thread 2 Concurrency in Multithreaded Applications Present the concept of concurrency in multithreaded applications. Concurrency occurs when two or more threads are in progress simultaneously. Concurrent threads can execute on a single processor. For example, the threads in a multithreaded application on a single processor execute concurrently, and the processor switches the execution resources between the threads. Clarify their queries (if any) related to this slide. 2018/9/19

Parallelism in Multithreaded Applications Definition: Parallelism occurs when multiple threads execute simultaneously. Parallel threads can execute on multiple processors. Thread 1 Thread 2 Parallelism in Multithreaded Applications Present the concept of parallelism in multithreaded applications. Parallelism occurs when multiple threads execute simultaneously. Parallel threads can execute on multiple processors. For example, threads in a multithreaded application on a shared-memory multiprocessor execute in parallel, and each thread has its own set of execution resources. Highlight that many professionals use the terms concurrency and parallelism interchangeably. However, concurrent is a superset that includes parallel as a case that requires hardware. Clarify their queries (if any) related to this slide. 2018/9/19

Introduction to Design Concepts The best time for threading while developing an application is during the design phase. The following are the design concepts: Threading for functionality Threading for performance Threading for turnaround Threading for throughput Decomposing the work Task decomposition Data decomposition Introduce the concept and importance of design in multithreaded applications. Multithreaded programming uses threads to concurrently execute multiple operations. It focuses on design, development, and deployment of threads within an application and the coordination between the threads and their respective operations. The best time for threading while developing an application is during the design phase. In this phase, you can accommodate all data and code restructuring related to threading. This helps reduce the effort in the overall development while minimizing any redesign. State the various design concepts. Clarify their queries (if any) related to these objectives. 2018/9/19

Threading for Functionality You can assign different threads to different functions done by the application. With threading for functionality: Chances of function overlapping are rare. It is easier to control the execution of concurrent functions within an application. Dependencies could persist between functions even without direct interference between computations. Describe and define the idea of threading for functionality. To simplify code, you can design it to assign different threads for functions such as a thread each for input, the graphical user interface (GUI), computation, and output. Threading for functionality is the easiest method because the chances of function overlapping are rare. This makes it easier to control execution of the concurrent functions within an application. Threading is easier than switching functions within a serial code. By assigning different threads to different functions, all the functions will be independent of each other. However, there can still be dependencies between functions even if there is no direct interference between computations. Clarify their queries (if any) related to this slide. 2018/9/19

Threading for Functionality – Example Different people involved in building a house are: Bricklayer Carpenter Roofer Plumber Painter Present an example that explains the concept of threading for functionality. Consider a situation where you need to build a house. To complete the job faster, you require several people, each doing smaller and specialized tasks. You may require a bricklayer to build the walls, a carpenter to make the floors, doors, and windows, a roofer to build the roof, a plumber to do the water fittings, and a painter to paint the house. All these people will perform their dedicated task. Questions for Discussion: What kinds of dependencies are there between the tasks that go into building a house? A: There will be dependencies between the workers’ tasks. For example, the roofer cannot start constructing the roof and the painter cannot paint the walls until the bricklayer builds the walls. In addition, you cannot carpet the floor until the carpenter lays the floor. Therefore, by assigning different tasks to different people, all the people will be independent of each other when they do their work. As a result, many tasks can be done in parallel. However, the dependencies will require some scheduling and coordination between tasks. Therefore, dividing the work among all these people helps you build the house faster. Clarify their queries (if any) related to this slide. Example: Building a House 2018/9/19

Example: Postal Service Threading for Performance You can thread a serial code in an application to increase the performance of computations. Example: Different tasks involved in a postal service are: Post office branches Mail sorters Delivery people Long distance transporters Describe and define the idea of threading for performance. For applications that require a large amount of computation, several serial optimization techniques are available that can increase performance. After these optimizations, you should consider threading the code to further improve efficiency and performance. Initially threading the computations may give a speedup boost to performance. If serial optimizations are done after adding threading, the serial optimizations can reduce the amount of computation enough that the threading added originally may become a detriment to performance. Therefore, the overhead of threading may be a much greater proportion of the execution time, such that, removing the threading could boost performance. Threading an application by dividing computations to be run in parallel is known as threading for performance. You can thread code in applications to either improve turnaround or improve throughput. Present an example that explains the concept of threading for performance. The postal service has multiple branches and multiple workers. Specialized workers such as sorters, delivery people, and transporters operate together to deliver millions of letters to people much faster than one person. Clarify their queries (if any) related to this slide. Example: Postal Service 2018/9/19

Example: Setting Up Dinner Table Threading for Turnaround Threading for Turnaround refers to completing a single job in the smallest amount of time possible. Example: Different tasks involved in setting up a dinner table are: One waiter organizes the plates. One waiter folds and places the napkins. One waiter decorates the flowers and candles. One waiter places the utensils, such as spoons, knives, and forks. One waiter places the glasses. You can thread codes in applications to either improve turnaround or improve throughput. Turnaround refers to completing a single job in the smallest amount of time possible. Present an example that explains the concept of threading for turnaround. Consider a real-world scenario where the job is to set up a banquet room. Setting all the tables involves a number of tasks, such as placing plates, glasses, utensils, and flowers on each table. In such situations, one waiter can be assigned to each table. On the other hand, waiters can be specialized. One waiter can do all the plates on all the tables; another could do all the glasses, and so on. In both methods, all the banquet tables could be set up in a fixed amount of time. Clarify their queries (if any) related to this slide. Example: Setting Up Dinner Table 2018/9/19

Example: Setting Up Banquet Tables Threading for Throughput Throughput refers to accomplishing the most tasks in a fixed amount of time. Example: Different tasks involved in setting up banquet tables are: Multiple waiters required. Each waiter performs one specific task for all the tables. Specialized waiters for placing the napkins. Specialized waiters for decorating the flowers and candles. Specialized waiters for placing the utensils, such as spoons, knives, and forks. Specialized waiters for placing the glasses. Throughput refers to accomplishing the most tasks in a fixed amount of time. Present an example that explains the concept of threading for throughput. Consider a real-world scenario where the job is to set up a banquet room. Setting all the tables involves a number of tasks, such as placing plates, glasses, utensils, and flowers on each table. In such situations, one waiter can be assigned to each table. On the other hand, waiters can be specialized. One waiter can do all the plates on all the tables; another could do all the glasses, and so on. In both methods, all the banquet tables could be set up in a fixed amount of time. Clarify their queries (if any) related to this slide. Example: Setting Up Banquet Tables 2018/9/19

Decomposing the Work Logical chunking or breaking down of the programs into individual tasks and identifying the dependencies between them is known as decomposition. The following table lists the decomposition methods, the respective design strategy, and implementation areas: Decomposition Design Comments Task Different activities assigned to different threads. Common in applications with several independent functions. Data Multiple threads performing the same operation but on different blocks of data. Common in audio, processing, imaging, and scientific programming. Describe and define the idea of decomposing the work. Transition from a serial programming model to a parallel programming model requires a modification in the flow of the process. Application developers or programmers should identify those activities that can be executed in parallel. To do this, they need to consider programs as a set of tasks with dependencies between them. Logical chunking or breaking down of the programs into individual tasks and identifying the dependencies between them is known as decomposition. Present the various decomposition methods, the respective design strategy, and its implementation areas. A program can be decomposed in several ways, such as number of tasks and the size of the data. Clarify their queries (if any) related to this slide. 2018/9/19

Task Decomposition Decomposing a program based on the functions that it performs is called task or functional decomposition. Key points to remember about task decomposition are: In multithreaded applications, you can divide the computation based on the natural set of independent tasks. Functional decomposition refers to mapping independent functions to threads that execute asynchronously. Parallel computing requires certain modifications to individual functions to preserve dependencies and to avoid race conditions. Describe and define the idea of task decomposition. In multithreaded applications, you can divide the computation based on the natural set of independent tasks. Decomposing a program based on the functions that it performs is called task or functional decomposition—mapping independent functions to threads that execute asynchronously. Parallel computing requires certain modifications to individual functions to preserve dependencies and to avoid race conditions. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Weeding and Mowing a Lawn Task Decomposition – Example Consider a situation in which you want to weed and mow your lawn. You have two gardeners. You can assign the task to the gardeners based on the type of activity. Present an example that explains the concept of task decomposition. Suppose you want to weed and mow your lawn. You have two gardeners. You can assign the task to the gardeners based on the type of activity. One gardener can mow the lawn, and the other gardener can weed it. You can consider mowing and weeding as two separate tasks of the main function of cleaning the lawn. To get the work done, you need to coordinate the tasks between them. Therefore, ensure that the gardener who will weed the lawn is not sitting in the middle of the lawn that needs to be mowed. Clarify their queries (if any) related to this slide. Example: Weeding and Mowing a Lawn 2018/9/19

Example: Grading the Answer Sheets Data Decomposition Dividing large data sets whose elements can be computed independently and associating the needed computation among threads is known as data decomposition. Key points to remember about task decomposition are: The same independent operation is applied repeatedly to different data. Computation-intensive tasks with a large degree of independence like computation-intensive loops in applications are good candidates for data decomposition. Example: The job is to grade a large stacks of answer sheets for a test. Describe and define the idea of data decomposition. In any threaded design, the first area to thread in a code is the most time-consuming area. These time-consuming computations usually involve large data sets. Dividing large data sets whose elements can be computed independently and associating the needed computation among threads is known as data decomposition. The same independent operation is applied repeatedly to different data. Computation-intensive tasks with a large degree of independence like computation-intensive loops in applications are good candidates for data decomposition. Present an example that explains the concept of data decomposition. Assume you are teaching a class that has several hundred students. After all the students have taken the final exam, you need to grade a large stack of test papers. Assume you have several helpers. If you divide the stack of test papers and give a portion to each helper, you have performed data decomposition. In this case, the data is the stack of test papers and the work associated with the data is grading the exams. Clarify their queries (if any) related to this slide. Example: Grading the Answer Sheets 2018/9/19

Which Decomposition Method to Use? Task decomposition and data decomposition can often be applied to the same problem, depending on the number of available resources and the size of the task. Examples: Data decomposition: In the mural painting example, you can divide the wall into two halves and assign each of the artists one half and all the colors needed to complete the assigned area. Task decomposition: In the final exam grading problem, if the graders take a single key and grade only those exams that correspond to that key, it would be considered task decomposition. Alternatively, if there are different types of questions in the exam, such as multiple choice, true/false, and essay, the job of grading could be divided based on the tasks to specialists in each of those question types. Based on the constraints of the system, it is extremely important to decide the type of decomposition you will use. The same problem can be used as an example for both decomposition methods. Task decomposition and data decomposition can often be applied to the same problem. They are not mutually exclusive. It may just be a matter of perspective. Depending on the number of available resources and the size of the task, the data or the task can be decomposed. In the final exam grading problem, to cut down on cheating, four variations of the test are created. In such a case, is data decomposition or task decomposition the best solution? The choice of the design strategy depends on the scenario. If each grader has all four keys and knows how to determine which key should be used, this is still data decomposition. However, if the graders take a single key and grade only those exams that correspond to that key, it would be considered task decomposition. Alternatively, in the above situation, if there are different types of questions in the exam, such as multiple choice, true/false, and essay, the job of grading could be divided based on the tasks to specialists in each of those question types. Different decompositions have different advantages. In situations where you can easily identify and divide the independent tasks by functionality, task decomposition becomes the strategy. However, data decomposition is beneficial where you can easily segregate the data or you have reason to believe that the application may be required to handle larger data sets as more cores become available. The main reason for threading codes is to enhance the performance of the application. Sometimes, the choice of the decomposition strategy becomes difficult. In such cases, the problem domain dictates the choice. The choice between data or task decomposition is dictated by the problem domain. If the amount of work scales with the number of independent tasks, the problem is probably best suited to task decomposition. If the amount of work scales with the amount of independent data, data decomposition is probably best. Choosing the most appropriate decomposition strategy for an application can make effective parallelization much easier. Clarify their queries (if any) related to this slide. 2018/9/19

Introduction to Correctness Concepts The usage of threads helps you enhance the performance by allowing you to run two or more concurrent activities. Correctness Concepts: Critical Region Mutual exclusion Synchronization Introduce the concept and importance of efficiency in multithreaded applications. Threads can help you enhance the performance of applications. With multithreading, you can run two or more concurrent activities. However, threads can make code more complex. These complexities arise from concurrent activities that run in parallel. Managing simultaneous activities and their possible interactions can lead to problems related to synchronization and communication. Clarify their queries (if any) related to these objectives. 2018/9/19

Race Conditions Race conditions: A Data Race: Are the most common errors in concurrent programs. Occur because the programmer assumes a particular order of execution but does not guarantee that order through synchronization. A Data Race: Refers to a storage conflict situation. Occurs when two or more threads simultaneously access the same memory location while at least one thread is updating that location. Result in two possible conflicts: Read/Write conflicts Write/Write conflicts Define the concept of data races between threads. Race conditions are the most common errors in concurrent programs. They occur because the programmer assumes a particular order of execution but does not guarantee that order through synchronization. A data race (or storage conflict) occurs when two or more threads simultaneously access the same memory location while at least one thread is updating that location. The two possible conflicts that can arise are: Read/Write Conflicts Write/Write Conflicts In writing multithreaded programs, understanding which data are shared and which are private becomes important for performance and program correctness. Data races can lead to computations that generate incorrect results. Present an example of race conditions: Consider a group of children playing musical chairs. The order of how multiple people land on the same chair determines who continues and who is eliminated. In the case of memory locations, the value of a location that had multiple threads trying to write to that location would be the value that was last written. For correct results, there is an assumption about the order in which threads have access to shared resources. Usually, this is related to the original serial execution order. However, because of the nondeterministic scheduling of threads for execution, a race condition results among the threads if nothing has been done to guarantee that the desired order is preserved. Clarify their queries (if any) related to this slide. 2018/9/19

Critical Region Critical Regions: Are portions of code that access (read and write) shared variables. Must be protected to ensure data integrity when multiple threads attempt to access shared resources. Define the concept of critical regions and Critical Section within a threaded application. Order of execution and order of variable access is well defined in a serial application. Order of execution between multiple threads depends on the operating system scheduling and may change between different runs. Therefore, race conditions result from assuming an order of execution between threads. However, this does not guarantee the expected order. Therefore, the biggest challenge in writing multithreaded application lies in ensuring that in a real-world environment, threads act in a predictable manner. This avoids potential problems of deadlocks and data corruption due to race conditions. Critical regions are portions of code that access shared variables. Accessing shared variables involves both reading and writing the variables. You need to protect critical regions to ensure data integrity when multiple threads attempt to access shared resources. Clarify their queries (if any) related to this slide. 2018/9/19

Mutual Exclusion Mutual exclusion refers to the program logic used to ensure single-thread access to a critical region. Describe and define the concept of mutual exclusion between threads within a threaded application. Only one thread should ever be executing in a critical region at any time. This criterion is enforced by mutual exclusion. Mutual exclusion refers to the program logic used to ensure single-thread access to a critical region. When a thread is executing code that accesses a shared resource in a critical region, any other thread that might desire entry to the critical region must wait to access that region. Present an example that explains the concept of mutual exclusion. Consider an ATM machine. There are a number of people who need to access the ATM machine to withdraw money , but only one person is allowed to access it. The first person in the queue access the ATM machine. Now, until the person finishes accessing the machine, all others in the queue will wait outside. Therefore, mutual exclusion is the logic that will protect the access to resources inside the Critical Region. It excludes the possibility of multiple threads concurrently accessing the Critical Region. Clarify their queries (if any) related to this slide. Example: ATM Machine 2018/9/19

Synchronization Synchronization: Is the process of implementing mutual exclusion to coordinate access to shared resources in multithreaded applications is known as synchronization. Controls the relative order of thread execution and resolve any conflicts among threads that might produce unwanted behavior. Example: Library Introduce and describe the general idea of an object that can be used to synchronize or coordinate execution of threads with each other. Implementing mutual exclusion to coordinate access to shared resources is known as synchronization. You can use synchronization to control the relative order of thread execution and resolve any conflicts among threads. The most common way to implement mutual exclusion is to have one thread hold a synchronization object. Other threads waiting to enter a critical region must wait for the thread holding the object to exit the critical region and release the object. Another useful synchronization method is known as event-based synchronization in which a thread is blocked until a particular program event occurs. For example, threads can be made to wait until another threads loads the data needed to do a computation. There are several types of synchronization objects. You can synchronize threads using read/write locks, semaphores, mutexes, condition variables, events, barriers, and Critical Sections. Which object to use depends on the type of synchronization needed and the objects available within the threading model being used. Present an example that explains the concept of synchronization. Consider how books are managed in a library. Suppose a member has checked out a book. Many libraries keep a list of other members who want the same book. They have to wait until the book is returned. Questions for Discussion: Once the book is returned to the library, who will get to check it out next? A: It depends on policy of the library. It could be the next person on the list, or someone most important to be moved to top of the list (priority), or the next patron that asks for the book before anyone on the list has been contacted. Clarify their queries (if any) related to this slide. 2018/9/19

Barrier Synchronization Barrier Synchronization Is used when all threads must finish a portion of the code before proceeding to the next section of code. Is usually done to ensure that all updates in the section of code prior to the barrier have completed before the threads begin execution of the code past the barrier. Define and introduce the concept of barrier synchronization. Barrier synchronization is used when all threads must finish a portion of the code before proceeding to the next section of code. This is usually done to ensure that all updates in the section of code prior to the barrier have completed before the threads begin execution of the code past the barrier. Present an example that explains the concept of barrier synchronization. In a race, all the participants first meet at the start line and then proceed. The start line in a race can be considered as the barrier where all the participants meet and then the process continues further after all have arrived. Clarify their queries (if any) related to this slide. Example: Race 2018/9/19

Deadlocks Deadlock: The four necessary conditions for a deadlock are: Occurs when a thread waits for a condition that never occurs. Most commonly results from the competition between threads for system resources held by other thread. The four necessary conditions for a deadlock are: Mutual exclusion condition Hold and wait condition No preemption condition Circular wait condition Introduce the concept of deadlocks. Deadlock occurs when a thread waits for a condition that never occurs. This problem most commonly results from the competition between threads for system resources that are already held by other threads. That is, a deadlock occurs whenever a thread is blocked waiting for a resource held by another thread and the second thread is blocked waiting for a resource held by the first thread. The desired resource will never become available because neither thread can release the resource it already holds. Due to this, the threads involved cannot proceed. Present an example that explains the concept of deadlocks. Consider a traffic jam at an intersection where you are unable to turn your car. Assume no driver is willing to back up, and the trees, mailboxes, lampposts, or some other obstacles are positioned on the corners preventing the cars from turning around. All routes are blocked and there is no way to proceed. This concept of a deadlock is usually known as gridlock. The four necessary conditions for a deadlock, also known as Coffman Conditions as listed in a 1971 article by E. G. Coffman, are: Mutual exclusion condition: A condition where a resource is either assigned to one thread or is available. Hold and wait condition: A condition where threads already holding resources may request new resources. No preemption condition: A condition where only a thread holding a resource may release it. Circular wait condition: A condition where two or more threads form a circular chain where each thread waits for a resource that the next thread in the chain holds. Clarify their queries (if any) related to this slide. Example: Traffic Jam 2018/9/19

Example: Robin Hood and Little John Livelock Livelock refers to a situation when: A thread does not progress on assigned computations, but the thread is not blocked or waiting. Threads try to overcome an obstacle presented by another thread that is doing the same thing. Introduce the concept of livelocks. A livelock refers to a situation when a thread does not progress on assigned computations, but the thread is not blocked or waiting. The threads in a livelock situation try to overcome an obstacle presented by another thread that is doing the same thing. Present an example that explains the concept of livelocks. Consider a situation where you meet someone traveling in the opposite direction in a hallway that is too narrow to accommodate both of you abreast. Both of you will move back and forth trying to squeeze by the other. Both of you continue moving, but neither of you is making any progress. The story where Robin Hood meets Little John on a log bridge is another example from literature. Clarify their queries (if any) related to this slide. Example: Robin Hood and Little John 2018/9/19

Introduction to Performance Concepts Simple Speedup Computing Speedup Efficiency Granularity Load Balance Introduce the concept and importance of performance in multithreaded applications. Clarify their queries (if any) related to these objectives. 2018/9/19

Speedup = Serial Time / Parallel Time Simple Speedup Speedup measures the time required for a parallel program execute versus the time the best serial code requires to accomplish the same task. Speedup = Serial Time / Parallel Time According to Amdahl’s law, speedup is a function of the fraction of a program that is parallel and by how much that fraction is accelerated. Speedup = 1 / [S+(1-S)/n +H(n)] Define the concept of speedup. To quantitatively determine the performance benefit of parallel computing, you can compare the elapsed run time of the best serial algorithm with the elapsed run time of the parallel program. This ratio is known as the speedup and it measures the time required for a parallel program execute versus the time the best serial code requires to accomplish the same task. Therefore, Speedup = Serial Time / Parallel Time To determine the theoretical limit on the performance of increasing the number of processor cores and threads in an application, Gene Amdahl in 1976 examined the maximum theoretical performance benefit of a parallel solution relative to the best case performance of a serial solution. According to Amdahl’s Law, speedup is a function of the fraction of a program that is parallel and by how much that fraction is accelerated. Therefore, Speedup = 1 / [S+(1-S)/n +H(n)] where, S is the percentage of time spent on executing the serial portion of the parallelized version, n is the number of processors, and H(n) is the parallel overhead. The numerator in the equation assumes that the program takes one unit of time to execute the best sequential algorithm. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Painting a Picket Fence Computing Speedup – Example Painting a picket fence requires: 30 minutes of preparation (serial). One minute to paint a single picket. 30 minutes to clean up (serial). Present an example that explains the concept of speedup. Consider painting a fence. Suppose it takes 30 minutes to get ready to paint and 30 minutes for the cleanup process after painting. For this illustration, assume that only a single person can do both of these tasks. Serial preparation time would include things such as getting tarps laid out, shaking and opening paint cans, and getting brushes out. Serial cleanup time would involve putting things away or washing. The fence used in the example can be any kind of fence that has small atomic units such as pickets, bricks, slats, or stones. Assume that it takes 1 minute to paint one single picket (or other unit). It will take one person 360 minutes to paint an entire fence of 300 pickets. This time includes the preparation, painting, and cleanup time. Clarify their queries (if any) related to this slide. Example: Painting a Picket Fence 2018/9/19

Computing Speedup Number of Painters Time Speedup 1 Consider how speedup is computed for different numbers of painters: Number of Painters Time Speedup 1 30 + 300 + 30 = 360 1.0X 2 30 + 150 + 30 = 210 1.7X 10 30 + 30 + 30 = 90 4.0X 100 30 + 3 + 30 = 63 5.7X Infinite 30 + 0 + 30 = 60 6.0X Show how speedup is computed in fence painting example for various numbers of painters. Now, consider how speedup is computed in the fence painting example for different numbers of painters. The table shows the computed speedups for different numbers of painting crew members. The overhead time needed to coordinate activities between painters—get painters to their assigned portions of the fence, ensure that there is no physical interference from putting too many people into a small area, and other activities—has been ignored. Infinite painters is the theoretical maximum and means that the time to paint the fence can be as close to zero as desired. This example shows how parallel speedup is ultimately limited by the amount of serial execution. Questions for Discussion: What if you use a spray gun to paint the fence? In such a case, what happens if the fence owner uses spray gun to paint 300 pickets in 1 hour? A: The spray gun introduces a better serial algorithm. Future speedup calculations must then use this serial algorithm timing because no one would go back to an inferior serial algorithm when given the chance to use something better. The new serial time with the spray gun is then 120 minutes (30 + 60 + 30) with the same serial setup and cleanup time. If no spray guns are available for multiple workers, what is the maximum parallel speedup? A: The assumption in the example is that the spray gun cannot be used if there is more than one painter. Therefore, there is no advantage from the spray gun in the parallel case. The theoretical maximum speedup in light of the spray gun is 2.0X with an infinite number of painters (120 / 60 = 2.0). Clarify their queries (if any) related to this slide. 2018/9/19

Parallel Efficiency Parallel Efficiency: Is a measure of how efficiently processor resources are used during parallel computations. Is equal to (Speedup / Number of Threads) * 100%. Consider how efficiency is computed with different numbers of painters: Number of Painters Time Speedup Efficiency 1 360 1.0X 100% 2 30 + 150 + 30 = 210 1.7X 85% 10 30 + 30 + 30 = 90 4.0X 40% 100 30 + 3 + 30 = 63 5.7X 5.7% Infinite 30 + 0 + 30 = 60 6.0X very low Define and show how to compute the efficiency metric for parallel processors (threads). Parallel efficiency is a measure of how efficiently processor resources are used during parallel computations. It is expressed as a percentage: Efficiency = Speedup / Number of Threads * 100% Low efficiency may prompt the user to run the application on fewer processors and free up resources to run something else, maybe another threaded process or other user’s codes. Present an example that explains the concept of efficiency: Consider the example of painting the picket fence by multiple painters. Assume that you knew that all painters were only busy for an average of less than 6 percent of the entire job time but are still getting paid for the whole time that the job was being conducted. Would you feel you were getting your money’s worth from the 100 painters? Would you want to hire a smaller crew that was busier for longer time periods? The table above gives the computed efficiencies of several different numbers of painting crew members. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Field and Farmers Granularity Definition: An approximation of the ratio of computation to synchronization. The two types of granularity are: Coarse-grained: Concurrent calculations that have a large amount of computation between synchronization operations are known as coarse-grained. Fine-grained: Cases where there is very little computation between synchronization events are known as fine-grained. Describe and define the concept of granularity. Granularity is more difficult to quantify than parallel speedup or efficiency. It has been defined as the ratio of computation to synchronization. Concurrent calculations that have a large amount of computation between synchronization operations are known as coarse-grained. Cases where there is very little computation between synchronization events are known as fine-grained. Synchronization, by definition, serializes execution. Therefore, programmers should strive for more coarse-grained activity in threads. Present an example that explains the concept of granularity. Consider a field divided between two farmers. Each farmer can harvest his half of the field without much synchronization (coordination) with the other farmer. Therefore, the speedup in harvesting the entire field should be about 2.0X. If two farmers can harvest the field in half the time, what about using five workers? What about 20? Or 100? How many more farmers can be added so that there is enough work for each farmer and most of the time spent will be on harvesting and not trying to avoid getting hit by tractors or injured by the sharp tools wielded by other farmers? The harvest illustration shows that work cannot be divided indefinitely. The problem is initially coarse-grained but becomes successively finer as the field is further divided. At some point, the partitions become so small that the farmers get in each others’ way. The overhead of synchronizing so many farmers begins to outweigh the benefit of parallelism. Clarify their queries (if any) related to this slide. Example: Field and Farmers 2018/9/19

Example: Cleaning Banquet Tables Load Balance Load balancing refers to the distribution of work across multiple threads so that they all perform roughly the same amount of work. Most effective distribution is such that: Threads perform equal amounts of work. Threads that finish first sit idle. Threads finish work close to the same time. Define and illustrate the idea of load balance between threads. Another challenge in writing efficient multithreaded applications is balancing the workload among multiple threads. The most effective distribution is to have equal amounts of work per thread in a way that the threads should finish close to the same time. If more work is assigned to some threads than to other threads, the threads with less computation will sit idle waiting for the threads that have more to be accomplished. Load balancing refers to the distribution of work across multiple threads so that they all perform roughly the same amount of work. Present an example that explains the concept of load balance. Consider the situation where the task is to clean the banquet tables. If the load is not balanced, the waiter with more work will take more time. Other waiters with less work need to wait for the first waiter to complete his work. As a result, they remain idle. Therefore, to balance the workload, it is better to assign the same number of tables to each person. Even with the same number of tables assigned, there may be more work to be done on some tables. For example, some tables may have had only a few diners, or other groups were extremely messy and cleaning up will require more effort. Balancing the workload among the threads in a multithreaded application implies that you should try and keep all the cores busy to get maximum performance. Large numbers of equal-sized tasks can be divided and distributed to balance the load better. However, you should be careful that the work is divided equally, not just the number of tasks, which may require different amounts of work. Clarify their queries (if any) related to this slide. Example: Cleaning Banquet Tables 2018/9/19

Summary A thread is a discrete sequence of related instructions that is executed independently. It is a single sequential flow of control within a program. The benefits of using threads are increased performance, better resource utilization, and efficient data sharing. The risks of using threads are data races, deadlocks, code complexity, portability issues, and testing and debugging difficulty. Every process has at least one thread, which is the main thread that initializes the process and begins executing the initial instructions. All threads within a process share code and data segments. Concurrent threads can execute on a single processor. Parallelism requires multiple processors. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) Turnaround refers to completing a single a task in the smallest amount of time possible, whereas accomplishing the most tasks in a fixed amount of time refers to throughput. Decomposing a program based on the number and type of functions that it performs is called functional decomposition. Dividing large data sets whose elements can be computed independently, and associating the required computation among threads is known as data decomposition in multithreaded applications. Applications that scale with the number of independent functions are probably best suited to functional decomposition while applications that scale with the amount of independent data are probably best suited to data decomposition. Race conditions occur because the programmer assumes a particular order of execution but does not use synchronization to guarantee that order. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) Storage conflicts can occur when multiple threads attempt to simultaneously update the same memory location or variable. Critical regions are parts of threaded code that access (read or write) shared data resources. To ensure data integrity when multiple threads attempt to access shared resources, critical regions must be protected so that only one thread executes within them at a time. Mutual exclusion refers to the program logic used to ensure single-thread access to a critical region. Barrier synchronization is used when all threads must have completed a portion of the code before proceeding to the next section of code. Deadlock refers to a situation when a thread waits for an event that never occur. This is usually the result of two threads requiring access to resources held by the other thread. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) Livelock refers to a situation when threads are not making progress on assigned computations, but are not idle waiting for an event. Speedup is the metric that characterizes how much faster the parallel computation executes relative to the best serial code. Parallel Efficiency is a measure of how busy the threads are during parallel computations. Granularity is defined as the ratio of computation to synchronization. Load balancing refers to the distribution of work across multiple threads so that they all perform roughly the same amount of work. Summarize all the key points learned in the chapter. 2018/9/19

Tools Foundation – Intel® Compiler and Intel® VTuneTM Performance Analyzer Chapter 2 Tools Foundation – Intel Compiler and Intel VTune Performance Analyzer 2018/9/19

Introduction to Intel Tools - Foundation Two of the tools that Intel offers to make threading easier and more comprehensive are: Intel® Compiler: Accelerates application performance. Supports multithreading capabilities in multi-core and multiprocessor systems. Intel® VTune™ Performance Analyzer: Identifies sections where an application can be threaded. Monitors performance of the application and the computer. Helps you tune your application. Introduce the two important Intel tools, the Intel® compiler and the Intel® VTune Performance Analyzer, by providing definitions. Explain that the Intel compiler has the following features: Accelerates application performance. Supports multithreading capabilities in multi-core and multiprocessor systems. Explain that the Intel VTune Performance Analyzer has the following features: Identifies sections where an application can be threaded. Monitors performance of the application and the computer. Helps tune the application. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Intel® Compilers Intel compilers: Supports IA-32, Intel® Core processor, and Intel® Itanium® architectures. Can be used with Microsoft Visual C++, GNU C++, and Compaq Visual Fortran. Offer advanced technology to optimize applications. Provide support to threaded applications. Are compatible with industry processes and standards such as the C++ application binary interface. Allow you to compile applications for both 32-bit and 64-bit environments. Are compatible with Intel® multi-core processors. Define the Intel compiler. Explain the features of the Intel compilers: Intel offers compilers that run on various operating systems, such as Microsoft Windows and Linux. The compilers support architectures, such as IA-32, Intel® Itanium®, and systems with Intel® EM64T. You can use these compilers with multiple programming languages, and they are compatible with Microsoft Visual C++, GNU C++, and Compaq Visual Fortran. These compilers offer advanced technology to optimize applications and provide support to threaded application. These compilers are compatible with industry processes and standards such as the C++ Application binary interface. You can use these compliers in leading software development environments, such as Microsoft Visual Studio and the Eclipse IDE. The compilers allow you to compile applications for both 32-bit and 64-bit environments and are compatible with Intel® multi-core processors. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Optimization Switches The Intel® compiler supports certain switches that enable optimization as shown in the table: Windows Linux Mac Description /Od -O0 Disables optimizations /Zi -g Creates symbols and provides symbol information /01 -01 Optimizes for size of binary code (server code) /02 -02 Optimizes for speed (default) /QaxP -axP Optimizes for Intel processors with SSE3 capabilities /03 with /QaxP -03 with -axP Optimizes for data cache—loopy and floating-point code Introduce the optimization switches by stating that the Intel compiler supports specific switches that enable optimization. Emphasize on the point that session covers coarse-grain switches that govern general compiler behavior and advanced multiple-phase optimizations. Explain the difference between the coarse-grain and fine-grain switches. Explain the optimization switches given in the table. Discuss the various misconceptions about the optimization switches: /01 is used to optimize speed, keeping in consideration the size of the binary code. /01 optimization turns off inline expansion of functions and optimizations that could increase the size of the binary code. The /02 switch is the default compiler switch for speed optimization. /02 offers the advantage of performing inline expansion of some user functions. The /03 and /Qaxp switches are together used for code that has double- or triple-nested loops. The /03 switch performs loop optimizations such as loop interchange. A misconception about the /03 switch is that it leads to an increase in the speed of applications. Some applications may run faster with /01, some with /02, and some with /03. You may not need a /03 switch if you are developing server and database applications. It is better suited to computational loops containing floating points. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Advanced Optimization Options Some of the advanced optimization features of Intel® compilers are as follows: Profile-Guided Optimizations Inter-Procedural Optimizations Compiler-Based Vectorization List out the advanced optimization features of Intel compilers one by one: Profile-Guided Optimizations Inter-Procedural Optimizations Compiler-Based Vectorization State that each of the three features will be taken up in detail in the subsequent topics. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Profile-Guided Optimizations Definition: Profile-guided optimization (PGO) is used for analyzing application workload at run time. PGO has the following features: Detects and provides opportunities to optimize the performance of the application. Can be used for optimizing function ordering, reorganizing code layout for improved instruction cache memory utilization, and accurate branch predictions. Helps in ordering the sequence of cases in a C/C++ switch statement. PGO is a family of switches. Define profile-guided optimization. Discuss its features: Detects and provides opportunities to optimize the performance of the application. Can be used for optimizing function ordering, reorganizing code layout for improved instruction cache memory utilization, and accurate branch predictions. Helps in ordering the sequence of cases in a C/C++ switch statement. Clarify queries (if any) of the participants related to this slide. 2018/9/19

PGO Process PGO is a three-stage process, which involves dynamic analysis. / 10101 Compile with prof_gen Source Files Instrumented Executable 4 Step 1 4 Files with Dynamic Run-Time Information Run Executable / 10101 Instrumented Executable Step 2 Introduce the slide by stating that PGO is a three-step process, which involves dynamic analysis. PGO is a three-stage process, which involves dynamic analysis. The execution-time characteristics of the application are recorded. Then, this information is later sent to other compiler optimization phases during recompilation. As a result, you receive the run-time behavior encoded in the optimizations. To use PGO, you must use a family of profile-guided switches at different times in the development process. First, you must compile code by using the prof_gen switch to generate an instrumented binary code. The prof_gen switch instruments the code to collect run-time statistics. As a result, an instrumented executable file is created. The instrumentation keeps count of the number of times a code block is entered and exited. However, this may cause the application to run five times slower than an uninstrumented application. Based on these counts, PGO determines the code lines that are hot or often used and cold or seldom used. Code instrumented with PGO should be run against one or more representative workloads. This step creates a dynamic data (.dyn) file. Every time you run the instrumented executable file, you obtain a new .dyn file with a different name. These are computer-generated names. The recommended procedure is to run the binary code several times with as many representative workloads as required. For each run, and therefore each workload, a new .dyn file is created. In the last step, PGO merges all the .dyn information files and truncates them to a single file that uses the prof_use compiler switch. This switch also creates the final optimized binary application. The .dyn files are merged to form a dynamic profile information (.dpi) file. This file is used by the compiler to modify and rearrange source code to optimize execution. The execution is based on the information gathered from the data generated from the runs executed in the second step. Clarify queries (if any) of the participants related to this slide. 4 / 10101 Source and Dynamic Run-Time Information Files Compile with prof_use File with Dynamic Profile Information Step 3 PGO Process 2018/9/19

PGO Process (Continued) PGO performs the following optimizations: Basic block ordering Better register allocation Better decision of functions to inline Function ordering Switch-statement optimization Better vectorization decisions PGO uses the prof_gen switch for optimization Additional functionality id provided by prof_genx prof_genx enables use of two Intel tools: Code Coverage (codecov) Test Prioritization (tselect) Discuss the various optimizations that PGO performs: Basic block ordering: The compiler uses profile information to closely order the frequently executed blocks according to the address in a function. This leads to better I-cache utilization. Better register allocation: Allocation of registers to hold the values of variables and temporaries is according to the most frequently executed basic blocks. As a result, the overflow of variables and temporaries to memory—when there is a shortage of registers—is more likely to occur in the infrequently executed basic blocks. Better decision of functions to inline: Inlining incurs code size costs, but reduces execution time by removing function call. By obtaining the execution counts of basic blocks and routines, the compiler can inline heavily-executed functions. There is little value in inlining a routine that is minimally executed assuming that the profile is representative. Function ordering: Function ordering is similar to basic block ordering. It places heavily executed functions closer according to address. This step enables better paging because of higher utilization of instructions within every page. Switch-statement optimization: The compiler can retrieve high-frequency cases and perform an explicit comparison for these cases at two instances, before doing the tests for the lower frequency cases or before performing the indirect jump where the target address is read at run time. Performing an explicit comparison before performing the indirect jump is costly on an out-of-order machine with branch prediction. Better vectorization decisions: Vectorization can incur code size cost, but can greatly reduce execution time in some cases. Therefore, it is better to avoid vectorization of loops with low trip counts or loops that are not executed frequently. Profile information can be used to provide the compiler with heavily executed loops with high trip counts. Introduce the PGO switch, prof_genX. Bring out the similarity and differences between the prof_gen and prof_genX switches. List the two Intel tools that PGO enables: Code Coverage (codecov): The codecov tool picks up a combination of the static and dynamic information and provides an exact picture of the code traversed at run time. Test Prioritization (tselect): The tselect tool performs specific analyses of the code and identifies strategic regression tests to run on the basis of previous profile runs. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Inter-Procedural Optimizations Definition: Inter-Procedural Optimization (IPO) performs a static analysis of the application at link time along with the details of the variables and functions seen during a typical link step. IPO has the following features: Improves application performance in programs that contain frequently used small- or medium-sized functions. Is useful for programs that contain function calls within loops. Performs analysis for multiple files or over entire programs to detect and perform optimizations. Define inter-procedural optimization: Inter-Procedural Optimization (IPO) performs a static analysis of the application at link time along with the details of the variables and functions seen during a typical link step. This link-time analysis information is used during subsequent compilations for effective optimizations. This analysis is used to arrange the code logically to eliminate dead code, enable inlining, and allow better register usage to enhance application performance. Discuss its features: IPO helps improve application performance in programs that contain frequently used small- or medium-sized functions. IPO is useful for programs that contain function calls within loops. You can do this analysis for multiple files or over entire programs to detect and perform optimizations. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Files with Intermediate Language Information IPO Process IPO is a two-stage process. / 10101 Files with Intermediate Language Information 4 Source Files Compile with IPO Step 1 Optimized Executable Link with IPO 4 Step 2 Start the slide by stating that IPO is a two-step process. Explain the IPO process in the following manner: The IPO process requires source files to be compiled with the IPO option. This creates object (.o) files that contain the intermediate language (IL) that the compiler uses in a later step. On linking, the compiler combines all the IL information and analyzes it for optimization opportunities. Typical optimizations of the IPO process include procedure inlining and reordering, eliminating dead or unreachable code, and substituting known values for constants. Clarify queries (if any) of the participants related to this slide. IPO Process 2018/9/19

IPO Optimizations Switches IPO uses two optimization switches: Consider the following function: Mac/Linux Windows Description -ip /Qip Enables inter-procedural optimization inside a file -ipo /Qipo Enables inter-procedural optimization across files function_a() { … function_b() … } function_b() { … } Explain the optimization switches given in the table. With /Qip, the analysis is limited to only the source file. With /Qipo, the analysis spans all source files. Code generation in module A can be affected by the processes in module B. Explain the concept of the IPO switches by using an example of a nested function. When you use the /Qip switch to compile this program, optimizations for single file compilation are enabled. The compiler can inline function_b() inside function_a() rather than calling the external function_b(). However, it depends on the size of the nest or the number of instances where function_b() is called within function_a(). Clarify queries (if any) of the participants related to this slide. 2018/9/19

Compiler-Based Vectorization Compiler-based vectorization allows you to invoke the Streaming SIMD Extensions (SSE) capabilities of the underlying processor. Suppose you are dealing with single precision, floating-point elements within the following scalar loop: float a[64], b[64], c[64]; int i; for (i=0; i<1000; i++) c[i] = b[i] + a[i]; not used A[1] B[1] C[1] A[3] B[3] C[3] + A[2] A[1] A[0] B[2] B[1] C[2] C[1] C[0] + Define compiler-based vectorization. Compiler-based vectorization allows you to invoke the Streaming SIMD Extensions (SSE) capabilities of the underlying processor. The capabilities allow some loops to be computed 2 to 16 times faster by using SSE registers and SSE instructions more efficiently. This advanced optimization feature analyzes loops and determines when it is safe and effective to compact loop-centric array data from several loop iterations into a single register and manipulate it with MMX™ and SSE style instructions. Vectorization reduces the number of iterations of the target loop and increases the speed of computation. Depending on data type, some loop trip counts may reduce to almost 16 times. Explain how vectorization works. Suppose you are dealing with single precision, floating-point elements within the following scalar loop: float a[64], b[64], c[64]; int i; for (i=0; i<1000; i++) c[i] = b[i] + a[i]; In the normal course, the processor goes through 1000 iterations, one element at a time. Without vectorization, three-fourths of the register is wasted. When the loop is vectorized, four floats can be used at one time. The number of iterations is now reduced from 1000 to 250. Clarify queries (if any) of the participants related to this slide. Sequence of Execution without Vectorization Sequence of Execution with Vectorization 2018/9/19

Supported Data Sizes for Vectorization SSE3 Instructions Streaming SIMD Extensions3 also known as Prescott New Instructions (PNI) is the third iteration of the SSE instruction set for the IA-32 architecture. SIMD – MMX, SSE, SSE2, SSE3 Support 2x doubles 4x floats 1x dqword 16x bytes 8x words 4x dwords 2x qwords MMX* SSE2 SSE3 SSE State that Streaming SIMD Extensions3 also known as Prescott New Instructions (PNI) is the third iteration of the SSE instruction set for the IA-32 architecture. The MMX technology did not include support for the dqword size data. In fact, MMX supported only the register sizes that are blue colored in the figure. SSE doubled the register length to include the red extensions of the MMX. SSE3 includes 13 new instructions designed to reduce the number of instructions needed to execute program tasks. These instructions are targeted to improve specific application areas, such as media and gaming applications. It offers the capability to work horizontally in a register, as opposed to the more or less strictly vertical operation of all the previous SSE instructions. SSE3 includes instructions to add and subtract the multiple values stored within a single register. Clarify queries (if any) of the participants related to this slide. Supported Data Sizes for Vectorization 2018/9/19

Compiler-Based Vectorization (Continued) Compiler-based vectorization is processor-specific. A compiler switch is used to control the level of vectorization that instructions should incorporate into the application being compiled based on the target processor. Two of the most recent important processor flag values that you can use with the Intel compiler are: Processor Value Windows Linux Mac W /QxW -xW Does not apply P /QxP -xP Vectorization occurs by default /QaxP -axP Compiler-based vectorization is processor-specific. A compiler switch is used to control the level of vectorization that instructions should incorporate into the application being compiled based on the target processor. Two of the most recent important processor flag values that you can use with the Intel compiler are /QxW and /QxP. Use /QxW if you want to generate instructions and optimize for Intel® Pentium® 4 compatible processors including MMX, SSE, and SSE2. Use /QxP if you want to generate instructions and optimize Intel® processors with SSE3 capability including Core Duo. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Automatic Processor Dispatch – ax[?] The automatic processor dispatch switch is used in situations where: The target processor may be unknown. The application may run on many generations of processors. You want to take advantage of high-level vectorization available on most of the recent processors. The automatic processor dispatch version of the vectorization switch uses processor-specific instructions and performs the vectorization: Linux Windows Description -axp /Qaxp Enables scalar as well as vector addition Explain that the automatic processor dispatch switch is used in situations where: The target processor may be unknown. The application may run on many generations of processors. You want to take advantage of high-level vectorization available on most of the recent processors. The automatic processor dispatch version of the vectorization switch uses processor-specific instructions and performs the vectorization. It includes a generic code path to ensure that the application runs on older processors as well. For example, you can use the /QaxP switch to optimize Intel® Core Duo processors with SSE3 capabilities and generate generic code that runs on all IA32 processors. This switch uses the SSE3 vectorization capabilities of the Core Duo. However, it adds a small code path to run on non-Core Duo processors. This switch involves low overhead and may result in slightly larger binaries. However, this is highly application-dependant. Advanced optimization options or very large programs may require additional resources during compilation for generating different versions of binary code for the same code and optimization whenever possible. Use the following transition to the next slide: Intel compilers provide features, such as auto-parallelization, OpenMP threading technology, and parallel diagnostics that support parallelization of applications. Let us look at the parallelization aspect of Intel compilers. Clarify queries (if any) of the participants related to this slide. Note ! The /QaxP switch involves low overhead and may result in slightly larger binaries. However, this is highly application-dependant. 2018/9/19

! Auto-Parallelization The auto-parallelization switch automatically converts serial source code into the equivalent multithreaded code: Ways to increase the probability of gaining performance from the auto-parallelization switch are: Increase/decrease computation threshold: /Qpar_threshold [n] Use in combination with PGO and IPO Use the compiler reporting feature: /Qpar_report [n] Windows Linux Mac Auto-Parallelization Switch /Qparallel -parallel Explain the auto-parallelization feature: The auto-parallelization feature of Intel compilers automatically converts serial source code into the equivalent multithreaded code. Intel compilers provide a switch that performs this task. Explain the /Qparallel switch. Auto-parallelization threads loop iterations to run in parallel. The compiler can identify easy loop candidates for parallelization. Such loops contain no dependencies between iterations. Auto-parallelization has various benefits. Some programs yield a free performance gain on multi-core systems, whereas other programs may result in reduced performance levels. Discuss the ways to increase the probability of gaining performance from the auto-parallelization switch: Increasing/decreasing computation threshold: This method may assist the compiler in creating a more successful binary code. The switch that guides the compiler heuristics for loops is: /Qpar_threshold [n] This option sets a threshold for auto-parallelization of loops based on the probability of profitable execution of the loop in parallel. This option is used for loops whose computation work volume cannot be determined at compile-time. Here, n is an integer whose value is the threshold for the auto-parallelization of loops. The value of n can range from 0 to 100: /Qpar_threshold0: Loops get auto-parallelized, regardless of computation work volume. /Qpar_threshold100: Loops get auto-parallelized only if profitable parallel execution is almost certain. The default value for n is 75. The intermediate values between 1 and 99 represent the percentage probability for profitable speedup. For example, setting n to 50 parallelizes only if there is a 50 percent probability of the code speeding up if it is executed in parallel. The compiler applies a heuristic to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads. Using in combination with PGO and IPO: This step can help the compiler make the correct choices while threading the code. Using the compiler reporting feature: This feature generates reports that provide diagnostic information by generating a report on the loops that are successfully parallelized and the dependencies that prevent parallelization of other loops. The switch that generates reports is: /Qpar_report [n] Here, n is the level of diagnostic messages to be displayed. Possible values for n are: 0: No diagnostic messages are displayed. 1: Diagnostic messages are displayed indicating successfully auto-parallelized loops. The compiler also issues a LOOP AUTO-PARALLELIZED message for parallel loops. 2: Diagnostic messages are displayed indicating successful and unsuccessful auto-parallelized loops. 3: Diagnostic messages as specified by 2, and additional information about any proven or assumed dependencies inhibiting auto-parallelization are displayed. The default value for n is 1. To display parallel diagnostic messages by default, you need to specify the option on the command line. Clarify queries (if any) of the participants related to this slide. ! Note For more information on automatic parallelization, you can visit http://www3.intel.com/cd/software/products/asmo- na/eng/compilers/clin/278607.htm. 2018/9/19

OpenMP Threading Technology OpenMP is a pragma-based approach to multithreading. The OpenMP switch that you can use for the compiler to recognize pragmas is: Consider the code snippet: Windows Linux OpenMP Switch /Qopenmp -openmp Define OpenMP. OpenMP is a pragma-based approach to multithreading. It is a way of informing the compiler about the need to parallelize the application. Explain the openmp switch with the help of the code example given on the slide: For the compiler to recognize pragmas, you need to use a switch. When you insert the omp parallel pragma, you specify the code to be parallelized. The omp for pragma divides the iterations of the for-loop among all the executing threads. To find out the code blocks that have been parallelized, you can use the /Qpar_report [n] feature. State that OpenMP will be discussed in detail in later sessions. Clarify queries (if any) of the participants related to this slide. #pragma omp parallel #pragma omp for for (i=0;i<MAX;i++) A[i]=c*A[i] + B[i]; 2018/9/19

Intel Thread Checker Source Instrumentation Parallel Diagnostics Parallel diagnostics is mostly used in conjunction with Intel® Thread Checker to diagnose threading bugs. The parallel diagnostics switch for Intel compilers to instrument source code is: To use the /Qtcheck switch, you must install Intel Thread Checker. Switch Windows Linux Intel Thread Checker Source Instrumentation /Qtcheck - tcheck Parallel diagnostics is mostly used in conjunction with Intel® Thread Checker. Let us understand which compiler switch is used for Intel Thread Checker to collect more in-depth information. Explain the /Qtcheck switch. To use the /Qtcheck switch, you must install Intel Thread Checker. Clarify queries (if any) of the participants related to this slide. ! Note For more information on parallel diagnostics, see the Intel Thread Checker documentation. 2018/9/19

Intel® VTune™ Performance Analyzer Utility: Intel® VTuneTM Performance Analyzer is a powerful and an easy-to-use tool. It collects, analyzes, and displays performance data for a wide variety of applications. The VTune Performance Analyzer performs the following functions: Collects performance data from the system Organizes and displays the data in a variety of interactive views Identifies potential performance issues and can suggest improvements Define Intel VTune Performance Analyzer. State that Intel VTune Performance Analyzer is a powerful and an easy-to-use tool. It collects, analyzes, and displays performance data for a wide variety of applications. The VTune Performance Analyzer can help you identify and locate the sections of code that show the highest amount of activity during a specific period. The VTune Performance Analyzer also displays how an application interacts with an operating system or other software applications, such as drivers. Features such as call graph and sampling make the VTune Performance Analyzer an efficient analyzing tool. The tool helps you tune an application at different levels, such as system, application, and computer architecture. The VTune Analyzer includes activities and data collectors for collecting the performance-related data. Activities control collection of data. Within an activity, you can specify the types of performance data you wish to collect. For each type of performance data, you need to configure the appropriate data collector to collect the requested performance data. Different data collectors collect different types of performance data. Explain the functions of the VTune Analyzer: Collects performance data from the system. Organizes and displays the data in a variety of interactive views, such as system-wide, source code, or processor-instruction perspective. Identifies potential performance issues and suggests improvements. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Supported Environments There are two ways to install and run the Intel® VTune™ Performance Analyzer: For native data collection: Profile applications that are running on the system with the VTune Performance Analyzer installed on it. For remote data collection: Also known as the Host/Target environment. Run profiling experiments on other systems on your subnet with VTune Performance Analyzer remote agents installed on them. State that the Intel® VTuneTM Performance Analyzer can be installed for local as well as remote data collection. Explain local and remote data collection: You can install the Analyzer for local and remote data collections. In local data collection, you profile applications running on the system with Intel VTune Performance Analyzer installed on it. In case of remote data collection (also called the host/target environment), you run profiling experiments on other systems with VTune Analyzer remote agents installed on them. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Native Performance Analysis The system performing local or native data collection can belong to any of the following processor families: Intel® Core and IA-32 Processors: Microsoft Windows* operating systems—GUI and command line interface (CLI) Red Hat Linux*—GUI and CLI SuSE Linux*— GUI and CLI Itanium® Family Processors: Microsoft Windows operating systems—GUI and CLI Red Hat Linux—GUI and CLI SuSE Linux—GUI and CLI State that the system performing local or native data collection can belong to any of the following processor families: Intel® Core and IA-32 Processors: Microsoft Windows operating systems—GUI + command line Red Hat Linux—GUI and CLI SuSE Linux—GUI and CLI Itanium® Family Processors: Red Hat Linux—GUI and Command Line (CLI) Mention that the VTune Analyzer supports Red Hat, SuSE Linux, and Windows. Remote profiling is possible from Windows to Windows, Windows to Linux, or Linux to Linux. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Remote Data Collection: Host/Target Environment The Intel® VTune™ Performance Analyzer is installed on a host system, and the remote agent is installed on the target system. Host System Windows operating system Controls target View results of data collection Target System IA-32 or Itanium® processor family Windows or Linux LAN Connection The VTune Performance Analyzer supports remote data collection. Such an environment is called a host/target environment. In such a case, the VTune Performance Analyzer is installed on a host system, and the remote agent is installed on the target system. Explain the host/target environment with the help of the figure on the slide. The host and target are connected to each other over a LAN. The host system runs a Windows operating system. This system controls the target and views the results of data collection. The target system can belong to the following processor families: IA-32 or Itanium processor family Windows or Linux Clarify queries (if any) of the participants related to this slide. 2018/9/19

Intel® VTuneTM Performance Analyzer The Intel® VTuneTM Performance Analyzer offers features that help in performance tuning: Sampling: Calculates the performance of an application over a period and for various processor events. Call Graph: Provides a graphical view of the flow of an application and helps you identify critical functions and timing details. Counter Monitor: Provides system-level performance information such as resource utilization during the execution of an application. This functionality works only on Windows. Hotspots View: Helps identify the area of code that consumes the maximum CPU time. Tuning Assistant: Provides tuning advice by analyzing the performance data. The tuning advice provides a guideline for the programmer to improve the performance of an application. This functionality works only on Windows. The Intel VTune Performance Analyzer offers features that help in performance tuning. You can use these features to detect and correct performance problems and optimize the applications for better performance. State the various features of the VTune Analyzer: Sampling: Calculates the performance of an application over a period and for various processor events. Call Graph: Provides a graphical view of the flow of an application and helps you identify critical functions and timing details. Counter Monitor: Provides system-level performance information such as resource utilization during the execution of an application. This functionality works only on Windows. Hotspots View: Helps identify the area of code that consumes the maximum CPU time. Tuning Assistant: Provides tuning advice by analyzing the performance data. The tuning advice provides a guideline for the programmer to improve the performance of an application. This functionality works only on Windows. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Sampling – A Detailed Study Sampling is the process of collecting performance data by observing the processor state at regular defined intervals. Sampling has the following features: Identify Hotspots: A hotspot is a section of code that contains a significant amount of activity for some internal processor event, such as clockticks, cache misses, or disk reads. Identify Bottlenecks: A bottleneck is an area in code that slows down the execution of an application. To optimize code, you need to remove the bottlenecks. Define sampling as the process of collecting performance data by observing the processor state at regular intervals. The VTune Performance Analyzer collects data about the required application and the system on which the application is running. You can then analyze this data to determine how to modify code to reduce execution time and improve the performance of the application. Discuss the features of sampling: Identify Hotspots: A hotspot is a section of code that contains a significant amount of activity for some internal processor event, such as clockticks, cache misses, or disk reads. Identify Bottlenecks: A bottleneck is an area in code that slows down the execution of an application. To optimize code, you need to remove the bottlenecks. Highlight the difference between a hotspot and a bottleneck. Explain that a hotspot states where to focus your attention when looking for bottlenecks. All bottlenecks are hotspots, but all hotspots need not necessarily be bottlenecks. Clarify queries (if any) of the participants related to this slide. ! Note Usually people confuse a hotspot and a bottleneck. A hotspot shows where to focus your attention when looking for bottlenecks. All bottlenecks are hotspots, but all hotspots need not necessarily be bottlenecks. 2018/9/19

Demo: Find the Hotspot Objective: Identify hotspots with the Intel® VTune™ Performance Analyzer. The main idea of this activity is to identify how to use sampling to identify hotspots on sample code using the VTune Performance Analyzer. Demonstrate to the participants how to identify hotspots with sampling using the VTune Analyzer. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Benefits of Profiling with Sampling Profiling with sampling has three key benefits: You need not modify your code. Sampling is system-wide. It is not restricted to your application. Sampling involves low overhead. Its validity is always statistically relevant, but is highest when perturbation is low. Discuss the benefits of profiling with sampling: You need not modify your code. However, you must compile or link with symbols and line numbers. Sampling is system-wide. It is not restricted to your application. In fact, you can see the activity of operating system code, including drivers. Sampling involves low overheads. Its validity is highest when perturbation is low. You can further reduce overheads by decreasing the number of samples or turning off progress meters on the user interface. Question to Ask Participants What if you are developing the application on a server, which is much higher powered (better CPU, more RAM, and faster disk) than the footprint of the systems your target audience is likely to have? You can use remote sampling rather than local. You can use a remote agent, which comes with analyzer. To do this, you would install the analyzer GUI on one server and the small remote agent on the server, which resembles the target server, where the application resides. In this manner, the effects of the analyzer itself are largely removed from the profiling analysis and your system-wide views will be even more relevant in terms of what the application users will see. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Sampling Collector The sampling collector is responsible for collecting the sampling data. Define sampling collector: The sampling collector is responsible for collecting the sampling data. It periodically interrupts the processor to obtain the execution context. Describe the Process View. A process is an intrinsic combination of code, data, and several operating system resources. The VTune Performance Analyzer displays a system-wide view of all the processes running on your system during data collection. The Process View displays the following panes: Process: Displays processes running in the application. Events: Displays events corresponding to the processes. Selection Summary: Displays the cumulative number of samples and events monitored for the processes selected in the chart. Legend: Provides details about the events in each process. A high number of events in a particular process indicates high CPU usage, which can indicate potential performance bottlenecks. Clarify queries (if any) of the participants related to this slide. Process View 2018/9/19

Sampling Collector (Continued) Describe the Thread View. The Thread View displays all threads that run within the selected processes. Threads are not shared across processes. However, threads from different processes can run a common module. The Thread View displays the following panes: Thread: Displays threads running in the application. Events: Displays events corresponding to the threads. Selection Summary: Displays the cumulative number of samples and events monitored for threads selected in the chart. Legend: Provides details about the events in each thread. By default, threads are named Thread0, Thread1, and so on. To distinguish threads while analyzing data, you can rename each thread with a meaningful name. Clarify queries (if any) of the participants related to this slide. Thread View 2018/9/19

Sampling Collector (Continued) Describe the Module View. Modules are executable binary image files. The Module View displays all the modules within the selected threads running during the period of sampling. The Module View displays the following panes: Module: Displays information, such as process, clockticks samples, and original module path, about modules running in a process. Selection Summary: Displays the cumulative number of samples and events monitored for modules selected in the chart. Legend: Provides details about the events in each module. Modules that have been called frequently during sampling data collection display the highest number of events or the most CPU time. Clarify queries (if any) of the participants related to this slide. Module View 2018/9/19

Sampling Collector (Continued) Describe the Hotspots View. The Hotspot View displays function names associated with selected modules that have symbol information available. The Hotspot View displays the following panes: Function: Displays functions running in the selected module. Events: Displays events corresponding to the functions. Selection Summary: Displays the cumulative number of samples and events monitored for functions selected in the chart. Legend: Provides details about the events in each function. If only the executable file is available, the nearest available external function name or the module name and offset are shown. If the executable and debug information are not available, the Hotspot view displays only the relative virtual address (RVA) information. Clarify queries (if any) of the participants related to this slide. Hotspots View 2018/9/19

Sampling Collector (Continued) Describe the Source View. You can view the source of applications you sample by drilling down from the Hotspot view to the Source View. The Source View displays the source code for the selected hotspot. The Source View displays the following panes: Source: Displays the source code or disassembly code of the module. The pane contains the function from which you drilled down. The columns on left of the source pane show line numbers and RVA information. The columns to the right show the events that you configured for the sampling collector in the activity. An empty field indicates that the line of code did not generate any samples. Summary: Displays information about each function in a table format. This information includes the function’s RVA, size, summary of the number of events the VTune Performance Analyzer recorded for each function, and any event ratios configured in the activity. You can double-click a function in the summary pane to navigate between functions in the source pane. For applications that have symbol and line number information, you can view the source code. However, for applications that do not have symbol and line number information, the VTune Performance Analyzer will only be able to display the assembly code. Clarify queries (if any) of the participants related to this slide. Move to the next slide by saying: The VTune Performance Analyzer provides two types of sampling mechanisms to collect data, Time-Based Sampling (TBS) and Event-Based Sampling (EBS). However, before performing TBS or EBS, you need to configure the sampling collector by using the different wizards available in the VTune Performance Analyzer. Source View 2018/9/19

Time-Based Sampling Time-Based Sampling (TBS) helps to reveal the routines in which the application spends the most time. This feature is applicable for Windows* only. When you perform an activity by using TBS, the Intel® VTuneTM Performance Analyzer performs the following functions: Executes the application you launched. Interrupts the processor at the sampling interval and collects data on the current process executing. Continues to collect sampling data until the specified application terminates or the specified sampling duration ends. Analyzes the collected data, creates an activity result in the Tuning Browser window, and displays the data collected for each module. Time-Based Sampling (TBS) is triggered by the timer services of the operating system after every N processor clockticks. This type of sampling helps to reveal the routines in which the application spends the maximum time. This feature is only applicable for Windows. In TBS, the VTune Performance Analyzer collects samples of an activity at regular intervals. TBS uses the operating system timer to calculate the time interval for collecting samples. The default time interval is one millisecond (ms). The collected samples present the performance data of all the processes that were running on the computer during sampling. The process that spent the most time in execution should contain the largest number of samples. When you perform an activity by using TBS, the VTune Performance Analyzer performs the following functions: Executes the application you launched. Interrupts the processor at the sampling interval and collects data on the current process executing. Continues to collect sampling data until the specified application terminates or the specified sampling duration ends. Analyzes the collected data, creates an activity result in the Tuning Browser window, and displays the data collected for each module. Highlight that TBS is not supported on Linux currently. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Event-Based Sampling Event-Based Sampling (EBS) is triggered by a processor event counter underflow. A few things to remember about EBS are as follows: EBS is performed on processor events, such as L2 cache misses, branch mispredictions, and retired floating-point instructions. EBS helps you determine which process, thread, module, function, or code line in the application generates the largest number of chosen processor events. You choose one or more processor events of interest from the Events list when configuring the sampling collector. EBS does not work on laptops and on non-Intel® processors. Event-Based Sampling (EBS) is triggered by a processor event counter underflow. Events that can be tracked are specific to the processor. Some of the events are L2 cache misses, branch mispredictions, and retired floating-point instructions. TBS is performed on the basis of operation system time, whereas EBS is performed on processor events. When you run an application, processor events also affect the performance of the application. EBS helps you determine which process, thread, module, function, or code line in the application generates the largest number of chosen processor events. To use EBS, you choose one or more processor events of interest from the Events list when configuring the sampling collector. You can also select a Sample After value to be used as the starting value of the event counter. When an event is detected during the execution of your application, the counter is decremented. When the counter reaches zero, an interrupt is given, a sample of the current processing state is saved, and the counter is reset to the start value known as the Sample After value. You should choose an appropriate Sample After value as one that generates approximately 1000 samples per second of execution time. This number of samples will yield statistically significant results and keep the overhead of sampling low. If you do not know or cannot estimate a good Sample After value, you can use a calibration run. This runs the sampling twice. The first run counts the total number of chosen processor events to best choose a good Sample After value. The second run gathers the EBS data for the chosen event. If you select more than one event to monitor, you may get one calibration run and one data collection run per event. Define a cache miss as a request for memory that is not found in the cache. Cache or cache memory is a fast storage buffer in the central processing unit of a computer. A cache miss is a request for memory that is not found in the cache. Cache or cache memory is a fast storage buffer in the central processing unit of a computer. EBS does not work on laptops and on non-Intel® processors. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Sampling Over Time View The Sampling Over Time functionality shows how sample distributions change with time. Discuss the Sampling Over Time functionality. The VTune Analyzer for Windows helps identify time-varying performance characteristics. The Sampling Over Time functionality shows how sample distributions change with time. Using Sampling Over Time, you can zoom in to specific time regions to focus your attention to the required execution periods. The Sampling Over Time View displays samples collected with respect to time for a single event. A most common use of the Sampling Over Time View is to identify when threads are running serially or in parallel. The Sampling Over Time View displays the following panes: Sampling Results: Displays the names of the selected items, such as processes, modules, or threads. Time: Displays the samples collected over time. This panel is divided into squares, each square representing a unit of time. The color of a square indicates the number of samples collected for that unit of time. The range of colors is from red, which indicates a large number of samples, to green, which indicates a small number of samples. White squares indicate that there were no event samples within the given time period. You can use the Sampling Over Time View to gather the following information: Context switching: From the Sampling Over Time View data, you can determine excessive context switching. Context switching is the process of switching from one executing thread to another without losing the state of the first thread. Context switching is said to be excessive if the following ratio is high: (Context Switches/sec) / ((number of processors) * (processor speed / 100MHz)) If you have a server application, you can use a small pool of threads to process work requests as they arrive. If your system already uses thread pooling, you can try to reduce the number of server threads. A small number of threads is often the most efficient configuration since the number of context switch targets will be smaller. Processor utilization: You can identify idle processors at any given time. A processor is idle if clockticks samples are collected for the system process or the idle thread. Therefore, if the system process receives samples in the Sampling Over Time View, there is scope for improving processor utilization at that time. Temporal location of hotspots: During event-based sampling for a particular event, the number of events may vary with time. With the Sampling Over Time View, you can see the specific periods of time when large number of events occur. For example, you may see that the number of L2 cache load misses per retired instruction is moderate across the entire workload. This is despite the periods of execution with a relatively large number of cache misses. The Sampling Over Time View of the L2 cache load misses enables you to see any period during the execution of the module when a large number of cache misses occurred. Thread interaction: From the regular sampling views, you can see the number of threads in an application. However, you cannot see how they interact with each other. By using the Sampling Over Time View, you can view patterns of thread behavior and thread interaction. Highlight that the Sampling Over Time View is not available for the Hotspot View. Clarify queries (if any) of the participants related to this slide. 2018/9/19 Sampling Over Time View

Sampling Over Time Usage Model The steps that you need to follow to bring up the Sampling Over Time analysis are: Collect sampling data. Select the items of interest from the process, thread, or modules view. Click the Display Over Time button to enable the Display Sampling Over Time View. Highlight region of interest. Click the Zoom In button to magnify the selected item. Click the Display regular sampling view for selected time-range button to see the process/thread/address histogram for a selected (zoomed) time region to see process/thread/address histogram for time region. Discuss the steps to bring the Sampling Over Time analysis as stated on the slide. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Demo: Use Sampling Over Time Objective: Identify how to use the Sampling Over Time functionality. The main idea of this activity is to get familiar with the interface of the Sampling Over Time functionality. Demonstrate to the participants how to use the Sampling Over Time functionality. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Call Graph – A Detailed Study The call graph collector of the Intel® VTuneTM Performance Analyzer helps you obtain information about the functional flow of an application. Call graph helps you: Obtain information about the number of times a function is called from a specific location. Obtain information about the amount of time spent in each function when executing its code. Define call graph. The call graph collector of the VTune Performance Analyzer helps you obtain information about the functional flow of an application. Call graph also helps you obtain information about the number of times a function is called from a specific location and the amount of time spent in each function when executing its code. Clarify queries (if any) of the participants related to this slide. 2018/9/19

! Call Graph Profiling Call graph performs the following functions: Tracks the function entry and exit points of your code at run time. Uses this data to determine program flow, critical functions, and call sequences. Requires the instrumentation of target binaries. Only profiles the code in Ring 3 or application level modules at run time. Explain the functions of call graph: The call graph profiling functionality tracks function entry and exit points of your code at run time. The data collected is used to determine program flow, critical functions, and call sequences. Call graph profiling is not system-wide profiling. It requires the instrumentation of target binaries and only profiles the code in Ring 3 or application level modules at run time. Ring 0—kernel and driver—modules are not instrumented. With call graph, you do not need to use a special compiler or to insert special API calls into your code. You only need generate debug information when you build the application. Clarify queries (if any) of the participants related to this slide. ! Note With call graph, you do not need to use a special compiler or to insert special API calls into your code. You only need generate debug information when you build the application. 2018/9/19

What can you Profile? Call graph can profile different software applications: Win32* applications Stand-alone Win32 DLLs Stand-alone COM+* DLLs Java* applications .NET* applications ASP.NET* applications Linux32* applications Linux64* applications List the various software that can be profiled using call graph: Win32 applications Stand-alone Win32 DLLs Stand-alone COM+ DLLs Java applications .NET applications ASP.NET applications Linux32 applications The VTune Analyzer can also analyze mixed code such as C++ and .NET code simultaneously. This feature is one of the VTune Analyzer’s most useful abilities. Clarify queries (if any) of the participants related to this slide. Move to the next slide by saying that the VTune Performance Analyzer displays results of the call graph in three synchronized views: graph, call list, and functional summary. 2018/9/19

Call Graph View Call Graph View 2018/9/19 Bright orange nodes indicate functions with the highest self time. The red lines show the critical path. The critical path is the most time-consuming call path. It is based on self time. Describe the Call Graph View. State that the Call Graph View displays the graphical structure of the application that you run by using the VTune Performance Analyzer. The Call Graph View also displays the caller function, the callee function, and time information. The time information provided by the graph view enables you to: Estimate the performance of the application. Find potential performance bottlenecks. Find the critical path. From the Call Graph View, you can identify the critical path in the application or module. The critical path is the most time-consuming path in the call sequence of an application that originates from the root function. In a call graph, the critical path is indicated by a thick red edge. This path contains the maximum edge time values in the application. The edge time is the total time taken by a callee function to execute when it is called from a specific caller function. This is known as self time. Bright orange nodes indicate functions with the highest self time. Clarify queries (if any) of the participants related to this slide. Call Graph View 2018/9/19

Call Graph Navigation View State that the Call Graph Navigation View is helpful when dealing with large applications. Explain that the Call Graph Navigation View provides an overview of the entire call graph and an indication of what portion is being displayed in the graph pane. Clarify queries (if any) of the participants related to this slide. Call Graph Navigation View 2018/9/19

Call Graph Call List View State that the Call Graph Call List View displays information about time and calls of the focus function. Mention that the Call Graph Call List View has three different sections: Focus functions: Displays the function selected in the upper pane. Callers: Displays the time spent by functions on calling the Focus Function. Callees: Shows the time spent on the Focus Function by the functions that call it. Clarify queries (if any) of the participants related to this slide. Call Graph Call List View 2018/9/19

Call Graph Function Summary View State that the Function Summary View consists all the functions that your application calls and the performance metrics list in a tabular form. Clarify queries (if any) of the participants related to this slide. Call Graph Function Summary View 2018/9/19

Call Graph Metrics The Call Graph Function Summary View displays various call graph metrics: Performance Metric Description Total time Time measured from a function entry to exit point. Self time Total time in a function excluding time spent in its children. Total wait time Time spent in a function and its children when the thread is blocked. Wait time Time spent on a function when the thread is suspended excluding time spent on the child functions. Calls Number of times the function is called. Class Class or the COM interface function to which the function belongs. Callers Number of caller functions that called the function. Callees Number of callee functions the function called. Introduce the concept of call graph metrics by stating that different results in the Call Graph View represent different cases. Explain the various performance metrics: Total time: Time spent from the start of execution of a function until the termination of execution. Self time: Time spent in executing a function, including the time spent waiting between executions of activities. However, self time does not include the time spent on child (callee) functions. Total wait time: Time spent on a function and its child functions when the thread is suspended. Wait time: Time spent on a function when the thread is suspended. However, the wait time does not include the time spent on the child functions. Calls: Number of times the function is called. Class: Class or the COM interface function to which the function belongs. Callers: Number of caller functions that called the function. Callees: Number of callee functions the function called. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Instrumentation Instrumentation: Does not change the functionality of the program. Is the process of modifying a program such that dynamic information is recorded during program execution. Increases the size of the code and the time it takes to run the application. By default, the Intel® VTuneTM Performance Analyzer instruments all application functions and system-level exports. There are two kinds of instrumentation: Source: Refers to source code instrumentation and provides detailed diagnostics. Binary: Is added at run time to an already built binary module and may not be able to yield as much detail as source instrumentation. Introduce the call graph instrumentation levels by stating that call graph can significantly increase your application execution time. You can reduce this increase by adjusting the module instrumentation levels. The lower the instrumentation level, lesser is the instrumentation code added to the application. Therefore, the application run time will not be affected much. Explain the various instrumentation levels: All Functions: Every function in the module is instrumented. Custom: You can specify which functions are required. If you do not have debug information, you will only be able to select from the exported functions. Export: Every function in the module’s export table is instrumented. Minimal: The module is instrumented but no data is collected for it. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Instrumentation Level Debug Information Required Call Graph Instrumentation Levels The call graph instrumentation levels help to reduce the increase in application execution time: Instrumentation Level Description Debug Information Required All Functions Every function in the module is instrumented. Yes Custom You can specify which functions are required. If you do not have debug information, you will only be able to select from the exported functions. Export Every function in the module’s export table is instrumented. No Minimal The module is instrumented but no data is collected for it. Introduce the call graph instrumentation levels by stating that during the profiling session, call graph can significantly increase your application execution time. You can reduce this increase by adjusting the module instrumentation levels. The lower the instrumentation level, less instrumentation code will be added to the application. Therefore, the application run time will not be affected as much as it would with higher instrumentation levels. Explain the various instrumentation levels: All Functions: Every function in the module is instrumented. Custom: You can specify which functions are required. If you do not have debug information, you will only be able to select from the exported functions. Export: Every function in the module’s export table is instrumented. Minimal: The module is instrumented but no data is collected for it. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Call Graph – Collector Page Introduce the topic by stating that sometimes there is a need to instrument only a few modules among the existing modules. State that the Call Graph – Collector page helps you choose the modules that you want to instrument and specify the level of instrumentation for the selected modules. Explain the steps to view the Call Graph – Collector page: In the VTune Performance Environment window, on the menu bar, click the Configure option. A drop-down menu opens. On the drop-down menu, click the Options option to display the Options dialog box. In the Options dialog box, in the left panel, under the Call Graph option, select the Collector sub-option to display the Call Graph – Collector page in the right panel. Explain the two main options of the Call Graph – Collector page: Limit collection buffer size to: The Limit collection buffer size to option is selected to adjust the size of the buffer used to store the data collected during call graph profiling. When the buffer is full, the VTune Analyzer suspends profiling, writes the data to the disk, and resumes profiling. Buffer size is limited by the amount of real memory available during the monitoring process. For long runs and large applications, if you do not specify a size, the computer may experience low memory. Enable COM tracing: The Enable COM tracing option allows call graph to instrument all COM calls even if instrumentation level of some modules is minimal. Clarify queries (if any) of the participants related to this slide. Call Graph – Collector Page 2018/9/19

Comparing Sampling and Call Graph The differences between sampling and call graph are listed below: Category Sampling Call graph Overhead Low overhead Higher overhead Analysis Data Available System-wide Ring 3 only on your application call tree What gets Profiled System-wide address histogram Show function level hierarchy with call counts, times, and the critical path Compilation Requirements For function level drill-down, must have debug information Must relink with /fixed:no, automatically instruments Basis for Results Can sample based on time and other processor events Results are based on time Explain the differences between sampling and call graph. Even if you plan to do call graph analysis, running sampling first helps you identify the modules that need to be analyzed. Therefore, you can limit the highest call graph overhead to only those modules of interest. A call graph involves a higher overhead than sampling for space and time required to run. System-wide sampling, which is event based, accurately identifies where the program is spending its time with negligible overhead. This overhead is typically less than 1 percent. However, call graph determines calling sequences and finds the critical path. However, it has a higher overhead. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Using VTune for Threaded Applications There are two uses of the Intel® VTuneTM Performance Analyzer related to threading performance: To improve the threading model: If your application is single-threaded, the first step in improving scaling is to add multithreading. If your application is already multithreaded, you can use sampling and call graph to improve performance and scaling on a multiprocessor system. To improve the efficiency of computation: You can identify code regions that have a high impact on application performance by using sampling and call graph data collectors. You can also analyze data in the Call Graph view to determine your program flow and identify the most time-consuming function calls and call sequences. Explain the uses of the VTune Performance Analyzer related to threading performance: Improving the threading model Improving the efficiency of computation You can use the VTune Analyzer to determine how much threading a sample section of code might benefit the overall execution time. You can also determine whether the implemented threads are balanced. A threading model is how you design your threading applications, such as how many threads to spawn, when to synchronize, and when to use multithreading. Using VTune, you can understand the overhead based on hotspot analysis and then modify the code, reanalyze, and make changes if necessary until the performance is optimized and all the processors are utilized efficiently. An effective threading model scales your application's performance effectively. The performance increases proportionally as you add more processing units, such as physical processors and Hyper-Threading Technology hardware threads. If your application is single-threaded, the first step in improving scaling is to add multithreading. If your application is already multithreaded, you can use sampling and call graph to improve performance and scaling on a multiprocessor system. VTune also helps improve the efficiency of computation. You can identify code regions that have a high impact on application performance by using sampling and call graph data collectors. You can use the Complete Performance Analysis wizard to create an Activity with the sampling data collector. After sampling, you can start at a high level by selecting all processes and modules that you want to tune. You can also analyze data in the Call Graph view to determine your program flow and identify the most time-consuming function calls and call sequences. Then, you can select high-impact code regions identified in the Call Graph views and double-click to drill down to source view for those regions. By doing this, you can incorporate algorithmic improvements that could speed up your application. Clarify queries (if any) of the participants related to this slide. 2018/9/19

VTune Analyzer on Serial Applications You can use the Intel® VTuneTM Performance Analyzer on serial applications to: Track areas of serial code that might parallelize. Determine what parts of the application you can thread to speed up the application. Three areas of program execution can cause performance slowdowns: • CPU-bound processes • Memory-bound processes • I/O-bound processes Explain that the VTune analyzer can track areas of serial code that might parallelize. You can also determine what parts of the serial application you can thread to speed up the application. Usually, three areas of program execution can cause performance slowdowns: CPU-bound processes: These processes perform slow or non‑pipelined operations. The operations include calculating square root or doing a floating point divide. Threaded, CPU-bound applications can potentially run twice as fast on dual-core processors. Memory-bound processes: These are processes that inefficiently use available memory or have large numbers of occurrences of cache misses. Tuned, memory-bound applications may potentially run 50 percent faster than before on multi-core processors. I/O-bound processes: These processes wait on synchronous I/O, formatted I/O, or when there is library- or system-level buffering. There may not be significant improvement in the performance of I/O-bound applications on multi-core processors. VTune Analyzer helps find main performance bottlenecks in your application by using sampling. You can further determine whether it is profitable to thread the performance bottlenecks by identifying the local and global hotspots. You can optimize local hotspots by introducing improvements such as reducing the number of cache misses and branch mispredictions. You can then use call graph to analyze the calling sequence of the program to find a more appropriate place to thread. Clarify queries (if any) of the participants related to this slide. 2018/9/19

VTune Analyzer on Multithreaded Applications You can use the Intel® VTuneTM Performance Analyzer on multithreaded applications to determine if your current threading model is balanced. You can detect load imbalance in two ways: • By inspecting the amount of time that threads take to execute by using both sampling and call graph. • By viewing the CPU information by clicking the CPU button on the VTune Performance Analyzer toolbar. Explain that the VTune analyzer can help determine if your current threading model is balanced. A threading model is balanced if each thread has an equal amount of work as other active threads of the application. You can detect load imbalance in two ways: Inspect the amount of time that threads take to execute by using both sampling and call graph: In the Thread View, each of the threads should consume the same amount of time. If not, you must distribute the amount of work among the threads. You can also use the Sampling Over Time View to gather information about context switching and processor utilization. In call graph, you can study the Self Time column in the Function Summary View to identify functions with large self time. View the CPU information by clicking the CPU button on the VTune Analyzer toolbar: The VTune Analyzer colors the samples by processor. In an ideally balanced situation, each processor executes equal amounts of work. This balance appears as an equal amount of color distributed across all of the samples on the graph. VTune Analyzer also provides statistical data on CPU utilization. You can configure the Data Collector functionality to measure the CPU utilization time. Then, you can view it in the Process View in the Selection Summary panel. This gives an idea of the CPU idle time. Additional threads can utilize this idle time. Threads that require idle resources can show up as ntoskrnl or intelppm.sys in the Process View. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Process View of Process During Sampling Results CPU Utilization Explain that the VTune Analyzer also provides statistical data on CPU utilization. You can configure the Data Collector functionality to measure the CPU utilization time. Then, you can view it in the Process View in the Selection Summary panel. This gives an idea of the CPU idle time. Additional threads can utilize this idle time. Threads that require idle resources can show up as ntoskrnl or intelppm.sys in the Process View. The Process View may sometimes display an idle task. Point out that in the figure on the slide, the VTIdle.exe process approximates the amount of CPU idle time during a run. Clarify queries (if any) of the participants related to this slide. Approximate CPU Utilization Process View of Process During Sampling Results 2018/9/19

Windows Command Line Interface The Windows command line interface (CLI) collects sampling data from the command line. CLI can help you automate data collection process. You can either view the data in the Intel® VTuneTM Performance Analyzer or export the data as ASCII text. To invoke the Windows CLI, type vtl at the command prompt. The Windows CLI follows the syntax: >vtl <command [command option]> … The VTune Performance Environment provides several types of analyses to help you understand the performance of your software. It allows you to add Activities to the project created by default. Within an Activity, you can specify the types of performance data you want to collect by using different analysis techniques. The VTune Performance Environment enables you to manipulate and examine the project, Activities, and Activity results from the command line. Explain the Windows command line interface: The Windows command line interface (CLI) collects sampling data from the command line. CLI can help you automate data collection process. You can run data collection several times, collect data based on different inputs, and store them in files to analyze later. You can either view the data in the VTune Performance Analyzer or export the data as ASCII text. State the CLI syntax. Inform the participants that help and more examples are available on Start  Programs  Intel® VTune™ Performance Analyzer  Help for the Command Line. Clarify queries (if any) of the participants related to this slide. ! Note For in-depth help and examples, go to Start  Programs  Intel® VTune™ Performance Analyzer  Help for the Command Line 2018/9/19

Sample Command Lines Command Description $ vtl –help Some sample command lines and their descriptions are listed below: Command Description $ vtl –help To list out the command help $ vtl -help -c sampling To list supported processor events for the server $ vtl create <activity name [options]> To create an activity $ vtl run <activity name> To run an activity $ vtl show To show the hidden project $ vtl show -a To show all details Explain some sample command lines. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Sample Command Lines (Continued) Some sample command lines and their descriptions are listed below: Command Description $ vtl view <activity name::result [options]> To view results of a particular activity type $ vtl delete <activity name> To delete a specific activity $ vtl delete To delete the last activity $ vtl delete -all To delete the entire project $ vtl version To list the software version $ vtl query -lc To list the installed collectors Explain some more sample command lines. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Sample Command Lines – Example Consider the following example command line: >vtl show activity -c sampling show run a1 show The commands used in the above example and their descriptions are: vtl show command helps you view the current project state. activity -c sampling command helps you create a new Activity with the sampling collector. show command helps you view the project again. run command helps you run the Activity. a1 show command helps you view the project once again with Activity results. Explain the example on the slide. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Summary Intel® compilers support multithreading capabilities and accelerate performance of applications because of dual-core and multiprocessor features. Profile-Guided Optimization (PGO) helps analyze the application workload to detect and provide opportunities to optimize the performance of the application. Inter-Procedural optimization (IPO) helps arrange code in a logical manner to eliminate dead code, enable inlining, and allow better register use to improve application performance. Compiler-based vectorization allows you to invoke the Streaming SIMD Extensions (SSE) capabilities of the underlying processor. The auto-parallelization feature automatically converts serial source code into its equivalent multithreaded code. OpenMP is a pragma-based approach to parallelism. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) The Intel® VTuneTM Performance Analyzer can help you identify and locate the sections of code that show the highest amount of activity during a specific period. Sampling is the process of collecting performance data by observing the processor state at regular intervals. A hotspot is a section of code that contains a significant amount of activity for some internal processor event, such as clockticks, cache misses, or disk reads. A bottleneck is an area in code that slows down the execution of an application. All bottlenecks are hotspots, but all hotspots need not necessarily be bottlenecks. Sampling does not require you to modify your code. However, you must compile or link with symbols and line numbers. Sampling is system-wide. It is not restricted to your application. In fact, you can see the activity of operating system code, including drivers. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) Sampling involves low overheads. Its validity is highest when perturbation of the code execution is low. You can further reduce overheads by decreasing the number of samples or turning off progress meters on the user interface. Time-based sampling (TBS) is triggered by the timer services of the operating system after every N processor clockticks. This type of sampling helps to reveal the routines in which the application spends the maximum time. Event-based sampling (EBS) is triggered by a processor event counter underflow. Events that can be tracked are specific to the processor. Some of the events are L2 cache misses, branch mispredictions, and retired floating-point instructions. The Sampling Over Time functionality shows how sample distributions change with time. Using Sampling Over Time, you can zoom in specific time regions to focus your efforts to the required execution periods. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) The call graph collector of the Intel® VTuneTM Performance Analyzer helps you to obtain information about the functional flow of an application. Instrumentation is the process of modifying a program such that dynamic information is recorded during program execution. Instrumentation does not change the functionality of the program. However, it increases the size of the code and the time it takes to run the application. The VTune Performance Analyzer can be used to improve the efficiency of computation and improve the threading model. The Windows command line interface (CLI) collects sampling data from the command line. CLI can help you automate data collection process. You can either view the data in the VTune Performance Analyzer or export the data as ASCII text. Summarize all the key points learned in the chapter. 2018/9/19

Programming with Windows Threads Chapter 3 Programming with Windows Threads 2018/9/19

! Win32 HANDLE Type Definition: A HANDLE is an opaque kernel object. A HANDLE is used to refer to various kernel objects, such as a thread, a mutex, an event, or a semaphore. All object creation functions return a HANDLE. Introduce the slide: All processes start with the execution of a single thread. This thread with which the process starts the execution is known as the main thread. There can be more than one thread in a process. Despite sharing the same address space and certain resources, each of these threads operates independently. Each thread has its own stack, which is usually managed by the operating system. The most basic thread creation mechanism provided by Microsoft is the CreateThread() function. Alternative functions to create threads are available in the Microsoft C library. Before you consider the prototype of the CreateThread() function, examine the HANDLE object type. The default size of the thread stack is 1MB. Define Windows HANDLE type. Windows objects are referenced by variables of HANDLE type, which is an opaque kernel object. A HANDLE is used to refer to various kernel objects, such as a thread, a mutex, an event, or a semaphore. All the object creation functions return a HANDLE, irrespective of whether you create a mutex, a thread, or a semaphore. You should never access or manipulate a HANDLE directly. Instead, you should always refer it through its respective API function. For example, if the HANDLE is a thread, you manipulate it through thread functions, and if the HANDLE is a semaphore, you manipulate it through semaphore functions. Clarify their queries (if any) related to this slide. ! Note You should never access or manipulate a HANDLE directly; instead, you should always refer it through its respective API function. 2018/9/19

Win32 Thread Creation The structure of the CreateThread() function is: HANDLE CreateThread( LPSECURITY_ATTRIBUTES lpThreadAttributes, DWORD dwStackSize, LPTHREAD_START_ROUTINE lpStartAddress, LPVOID lpParameter, DWORD dwCreationFlags, LPDWORD lpThreadId); // Out Present and explain the prototype of the CreateThread() function along with the parameters involved. This is the function prototype for CreateThread(), which creates a Windows thread that begins execution of the function provided as the third parameter. The CreateThread() function returns a HANDLE that is used as a reference to the newly created thread. The following parameters are used in the CreateThread() function: lpThreadAttributes: Every kernel object has security attributes. Therefore, the first parameter, ThreadAttributes, allows the programmer to specify the security attributes of the thread. This parameter is a pointer to a SECURITY_ATTRIBUTES structure, and it can be NULL. A null value sets up default security for the object. The security of the object mainly decides which processes can access and manipulate the object. dwStackSize: The StackSize parameter allows the user to specify the amount of stack space that needs to be reserved for the thread. Windows provide the threads with a default stack size by setting the parameter to zero. The default value of the stack size is 1 megabyte. However, you can create different stack sizes within threads to optimize memory usage. lpStartAddress: The third parameter, StartAddress, is a pointer to the function and the start address of a thread in the code after its launch. It points to the global function on which the thread will begin its execution. In response, the operating system returns a HANDLE to identify that thread. This implies that StartAddress specifies a function pointer to the actual code that the thread runs. lpParameter: The fourth parameter is a pointer and a single, 32-bit parameter value that is passed to the thread. The thread function can accept a single LPVOID parameter. The parameter is passed as an argument to the StartAddress. If no parameters are accepted by the thread function, this value can be NULL. If the thread function accepts more than one parameter, you can encapsulate all the parameters into a single structure. Then, you can send a pointer to the structure as the fourth parameter of the CreateThread() function call and decompose the structure into its components as the first thing done in the threaded version of the function. dwCreationFlags: The fifth parameter specifies various configuration options and allows you to control the creation of the thread, such as starting as a suspended thread. The default (using ‘0’ for this parameter) is to start the thread as soon as it has been created by the system. To create a thread that is suspended after creation, use the defined constant CREATE_SUSPENDED. lpThreadId: ThreadId is the unique number assigned to the thread at creation. This is the variable that receives the thread’s identifier. It is an out parameter, and the value assigned will be unique among all the running threads. If the CreateThread() function fails to create a thread, it will return NULL. You can find the reason for the failure by calling the GetLastError() function. It is recommended that you check error codes from all Windows API calls. After successfully creating a thread, the CreateThread() function returns the HANDLE of the new thread. After you create the thread and receive its HANDLE, you can set the thread’s priority. You can use SetThreadpriority() function to set the priority value for the specified thread. This value, together with the priority class of the thread's process, determines the thread's base priority level. Alternative functions are available to create threads within the Microsoft C library. The CreateThread() function provides a flexible and an easy-to-use mechanism for creating threads in Windows applications. However, the CreateThread() function does not perform per thread initialization of C runtime data blocks and variables. Therefore, you cannot use the CreateThread() function in any application that uses the C runtime library. However, Microsoft provides two additional methods, _beginthread and _beginthreadex(). Avoid using _beginthread method because it does not include security flags or attributes and does not return a thread ID. Arguments for _beginthreadex() are the same as the CreateThread() function. Therefore, it is recommended that you use _beginthread() function for writing applications that use the C run time library. Clarify their queries (if any) related to this slide. 2018/9/19

LPTHREAD_START_ROUTINE The CreateThread() function: Expects a pointer to some global function with the prototype LPTHREAD_START_ROUTINE. Returns a DWORD. Has the calling convention is WINAPI. Receives a single LPVOID (void*) parameter to the function. DWORD WINAPI MyThreadStart(LPVOID p); Present more details about the function that is threaded via CreateThread(). The CreateThread() function expects a pointer to some global function. The prototype of the global function is THREAD_START _ROUTINE, which returns a DWORD and has the calling convention of WINAPI. You only obtain a single parameter, LPVOID (void *), to the function that is to be threaded. Therefore, when you start a function on a thread, you can only pass one argument to that function. If you are threading a serial code and the function that you need to thread expects multiple arguments, you may require some alterations. These alterations may include setting up a structure to hold all the original function requirements and use a pointer to the structure as the single parameter to the threaded version. Clarify their queries (if any) related to this slide. 2018/9/19

Using Explicit Threads Following modifications to the code are required to use an explicit threading model: Identify the portions of the code that can be executed in parallel. Encapsulate the code into a function. If the serial code is already a function, you may need to write a driver function to coordinate the work of multiple threads. Add CreateThread() call to assign thread(s) to execute the function. Describe the major code modifications that are needed to use an explicit threading model. To modify the code to use an explicit threading model, you need to do the following: Identify the portions of the code that can be executed in parallel. Encapsulate that code into a function. If the serial code is already a function, you might need to write a driver function to coordinate the work of multiple threads. The driver function accepts a structure of multiple parameters and then calls the function that really does the computation. Moreover, splitting up the tasks among threads can be done in the driver function before calling the routine that does the work. Add CreateThread() call to assign thread(s) to execute the function. This means that you need to call CreateThread() to map functions to threads. In explicit threading, there is no parent-child relationship enforced between threads. However, when a thread spawns a thread, you may construct a parent-child relationship between these two threads. In explicit threading, whether it is Windows threads, all threads are potential creators. Therefore, a thread can create another thread that can turnaround and terminate its creator. The only exception to this is the main thread because the main thread contains the process information. Therefore, if the main thread terminates, the process stops and all other threads are destroyed automatically. Clarify their queries (if any) related to this slide. 2018/9/19

Cleaning up after Threads The syntax for the CloseHandle() function to terminate threads is: The syntax for the function to terminate a thread is: The syntax for the function to retrieve the termination status of the specified thread end a thread: BOOL CloseHandle (HANDLE hObject); VOID ExitThread( DWORD dwExitCode ); Present the method to clean up after threads are done. After a thread exits, it is a good idea to return the system resources used by the thread back to the control of the operating system. The CloseHandle() function frees up any resources that the system has allocated to a thread, destroys the thread, or other kernel object. Unutilized thread HANDLES take up memory space. Continuous creation of threads without cleaning up the HANDLES of threads that have completed computation and terminated, can lead to a memory leaks. In case you do not destroy threads, there can be a situation when you may create the maximum number of threads that can be supported within a process, and then you will not be able to create any new threads. Even though the exit of the process will perform the resource reclamation automatically, it is a good practice to cleanup the resources of threads that have finished execution. Incase you do not destroy threads, there is a possibility that the maximum number of threads that can be supported within a process will be reached and you will be unable to create any more new threads. Use ExitThread() function to terminate a thread along with a return code. Then use the GetExitCodeThread() function to obtain the return value from the thread. The parameter dwExitCode specifies the exit code for the calling thread. You can use the GetExitCodeThread() function to retrieve a thread's exit code. If the thread is the main thread in the process when this function is called, the thread's process is also terminated. If the primary thread calls the ExitThread() function, the application exits. The state of the thread object becomes signaled, releasing any other threads that had been waiting for the thread to terminate. The thread's termination status changes from STILL_ACTIVE to the value of the dwExitCode parameter. If the GetExitCodeThread() function returns a nonzero value, it indicates success and if the value returned is zero, it indicates failure. To obtain an extended information, you can call the GetLastError() function. The parameters involved in the GetExitCodeThread() function are: hThread: It represents the HANDLE to the thread. lpExitCode: It is a pointer to the 32-bit variable to receive the thread termination status. If the specified thread has not terminated, the termination status returned is STILL_ACTIVE. The following list shows termination statuses that can be returned if the process has terminated: The exit value specified in the ExitThread() or TerminateThread() function. The return value from the thread function. The exit value of the thread's process. Clarify their queries (if any) related to this slide. BOOL GetExitCodeThread( HANDLE hThread, LPWORD lpExitCode ); 2018/9/19

Example: Thread Creation Consider the following example that uses Windows threads: #include <stdio.h> #include <windows.h> DWORD WINAPI helloFunc(LPVOID arg) { printf (“Hello Thread\n”); return 0; } main() HANDLE hThread = CreateThread(NULL, 0, helloFunc, NULL, 0, NULL); Explain the code. The independent task in this example is to print the phrase Hello Thread encapsulated into the helloFunc() function. The DWORD WINAPI declaration specifies that this function can be mapped to a thread when you call the CreateThread() function. When the program starts execution, the main (process) thread begins execution of the main() function. The main thread creates a child thread that will execute the helloFunc() function. After creation of the helloFunc thread, the main thread encounters the program termination, which will destroy all the child threads. If the created thread does not have enough time to print Hello Thread before the main thread exits, nothing will be printed from execution of this application. The main thread holds the process information. When the process terminates, all the threads terminate. Therefore, to ensure that the desired printing is accomplished, the main thread must wait for the created thread to finish before it exits. Therefore, to avoid writing applications that spawn threads and end before any useful work begins; you need a mechanism to wait for threads to finish their processing. Question for Discussion: What are the possible outcomes for the above code? Answer: Two possible outcomes: Message Hello Thread is printed on screen. Nothing printed on screen. This outcome is more likely that previous. Main thread is the process and when the process ends, all threads are cancelled, too. Thus, if the CreateThread() call returns before the operating system has had the time to set up the thread and begin execution, the thread will die a premature death when the process ends. Clarify their queries (if any) related to this slide. What Happens? 2018/9/19

Waiting for Windows* Threads Consider the following example to print “Hello Thread” using Windows threads: #include <stdio.h> #include <windows.h> BOOL threadDone = FALSE ; DWORD WINAPI helloFunc(LPVOID arg ) { printf (“Hello Thread\n”); threadDone = TRUE ; return 0; } main() HANDLE hThread = CreateThread(NULL, 0, helloFunc, NULL, 0, NULL ); while (!threadDone); // wasted cycles ! Discuss the execution of the code. Waiting for a thread by using a while statement is not a good idea because you use a lot of processor resources in the entire process. Consider the method of making the master thread wait and prevent it from exiting the process before the spawned thread executes. While the message “Hello Thread” will eventually be printed, the main thread is in a spin-wait until the value of threadDone is changed by the CreatedThread() function. At this point, the thread has completed all the required computation and you have not lost any work if the ending process kills the thread rather than by natural causes of the RETURN. However, if running on a single processor system or with Hyper-Threading Technology, the main thread may be spinning its wheels, soaking up thousands or millions of CPU cycles before the operating system swaps it out and allows the created thread to execute. Therefore, this is not a good idea. Waiting for a thread this way is not recommended because of the possibility to waste resources of the processor. When waiting for another thread to terminate, consider the following two cases: Wait until the thread terminates. Wait a predefined amount of time for thread termination. The time for which the waiting thread must wait for a thread to terminate depends on the purpose of waiting and the condition whether there is other computation that can be done instead of being blocked waiting. The next section discusses the API functions that Windows has to wait on thread termination and other kernel objects. Questions for Discussion: What are the potential race conditions in this code? Answer: The race condition in this case is that the main thread can easily terminate before the CreateThread() can actually has chance to do too many things. In addition, another potential race condition is that I/O in general is not thread safe. How can you modify the main function as a method of waiting for thread to finish? Answers: In the above example, you can try including getch() in the main function as a method of waiting for thread to finish. In other words, to avoid writing applications that spawn threads and then end before any useful work has the chance to begin, you need some mechanism to wait for threads to finish their processing. Clarify their queries (if any) related to this slide. Not a good idea! 2018/9/19

Waiting for a Thread The function prototype for the WaitForSingleObject() function’s syntax is: You can use the WaitForSingleObject() function to wait on a single thread to terminate. WaitForSingleObject() will return one of the following values: WAIT_OBJECT_0 WAIT_TIMEOUT WAIT_ABANDONED WAIT_FAILED DWORD WaitForSingleObject( HANDLE hHandle, DWORD dwMilliseconds); Present and describe the WaitForSingleObject() routine. You can use the WaitForSingleObject() function to wait on a single thread to terminate. This function requires a HANDLE as a parameter, which can be for any kernel object, such as a thread, an event, a mutex, or a semaphore. When the input HANDLE is a thread, the thread calling the WaitForSingleObject() function will block waiting for the thread, whose HANDLE has been sent as the parameter, to terminate. HANDLES have two states, signaled and non-signaled. When a thread exits, its HANDLE becomes signaled, else it is non-signaled. The WaitForSingleObject() function will block the calling thread until the HANDLE is signaled; for threads, when the thread has terminated execution. The second parameter is the time limit to wait for the completion of the call. If this wait time expires, the function will return regardless of whether the HANDLE is signaled or not. In this instance, the return code indicates that the time expired caused the return. Microsoft defines a special constant, INFINITE, to indicate that the calling thread wants to wait indefinitely for the HANDLE to be signaled. In this way, you can notify the operating system that the calling thread has no other work to do until this particular event occurs, and therefore, it can be moved off the run queue to the wait queue. The operating system can then switch to a thread that is in the ready-to-run state. If you use a non-INFINTE value for this parameter, check the error code to determine the reason or why the function returned. WaitForSingleObject() function will return one of the following values: WAIT_OBJECT_0: The WaitForSingleObject() function returns this value when the object that is being waited on enters the signaled state. WAIT_TIMEOUT: The WaitForSingleObject() function returns this value when the specified timeout value occurs prior to the object entering the signaled state. WAIT_ABANDONED: If the HANDLE refers to a Mutex object, this return code indicates that the thread that owned the mutex did not release the mutex prior to termination. WAIT_FAILED: This value indicates that an error has occurred. To get additional error information regarding the cause of the failure, use GetLastError() function. Clarify their queries (if any) related to this slide. 2018/9/19

Waiting for Many Threads The function prototype for the WaitForMultipleObjects() function’s syntax is: The WaitForMultipleObjects() function has the following parameters: nCount lpHandles fWaitAll dwMilliseconds DWORD WaitForMultipleObjects ( DWORD nCount, CONST HANDLE *lpHandles, // array BOOL fWaitAll, // wait for one or all DWORD dwMilliseconds); Present and describe the WaitForMultipleObjects() routine. The WaitForMultipleObjects() function is used to wait for multiple signaled objects. It waits until one or all the specified objects are in the signaled state or the time-out interval elapses. The WaitForMultipleObjects() function has the following parameters: nCount: It specifies the number of HANDLEs that must be waited upon from the array of HANDLEs. The value for nCount cannot exceed the maximum number of object HANDLEs specified by the MAXIMUM_WAIT_OBJECTS constant. The nCount elements from the array are sequential starting from the address lpHandles to lpHandles[nCount-1]. lpHandles: This parameter specifies an array of object HANDLEs to wait on. fWaitAll: It determines whether the wait will be for all nCount objects or for any one of the objects. If fWaitAll is set to TRUE, the WaitForMultipleObjects() function only returns when all HANDLEs (threads) have been signaled. If fWaitAll is FALSE, the WaitForMultipleObjects() function returns when any one or more of the threads is signaled. When waiting for one of many threads, the return code is used to determine which HANDLE was signaled. The return value specifies the first completed thread in the list. dwMilliseconds: The fourth parameter or the time out value is the same as that for the WaitForSingleObject() function. If fWaitAll is set to FALSE, and if the number of objects in the signaled state is greater than one, the array index of the first signaled or abandoned value in the array—starting at array index zero—is returned. WaitForMultipleObjects() can wait for a maximum of 64 objects, for example, threads. This is the value defined by the operating system (MAXIMUM_WAIT_OBJECTS). Clarify their queries (if any) related to this slide. 2018/9/19

Using Wait Functions The Wait functions will always block the calling thread until the object(s) become signaled or the time limit expires. Using different objects within the Wait functions can mean that separate calls have a different purpose that is dependent on each different object. The Wait function’s behavior is defined by the object referred to by the HANDLE: For a Thread: Signaled means terminated. For a Mutex: Signaled means available. For a Semaphore: Signaled state means that the count of the semaphore is greater than zero. For an Event: Signaled state means that the event is in a signaled state. Some thread has called the SetEvent() function or PulseEvent() function. Discuss the Wait functions. The Wait functions will always block the calling thread until the object(s) become signaled or the time limit expires. Thus, using different objects within the Wait functions can mean that separate calls have a different purpose that is dependent on each different object. It is expensive to deal with kernel objects. Manipulating kernel objects is more expensive than manipulating other objects. The advantage of kernel objects is that you can carry out inter-process coordination. There may be situations when one process may need to hold a mutex of a different process, which you can do through the kernel object. If you only need intra-process coordination, you can replace the mutex with a Critical Section, which is cheaper. The WaitForSingleObject() function enters kernel irrespective of whether the lock is achieved. However, the EnterCriticalSection() function enters kernel only when lock is not achieved. Therefore, it is recommended to use EnterCriticalSection() function for applications where there is not much contention. The Wait function’s behavior is defined by the object referred to by the HANDLE: For a Thread: Signaled means the thread is terminated. For a Mutex: Signaled means the mutex is available. For a Semaphore: Signaled state means that the count of the semaphore is greater than zero. Thus, a WaitFor call will return after having decremented the count by 1. For an Event: Signaled state means that the event is in a signaled state. Some thread has called the SetEvent() function or PulseEvent() function. The use of an event is to signal other threads that some event or condition of the computation has been achieved. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Multiple Threads Consider the following example that uses the WaitForMultipleObjects() function: #include <stdio.h> #include <windows.h> const int numThreads = 4; DWORD WINAPI helloFunc(LPVOID arg ) { printf (“Hello Thread\n”); return 0; } main() { HANDLE hThread[numThreads]; for (int i = 0; i < numThreads; i++) hThread[i] = CreateThread(NULL, 0, helloFunc, NULL, 0, NULL ); WaitForMultipleObjects(numThreads, hThread, TRUE, INFINITE); } Present example of usage of WaitForMultipleObjects() function. This is a better example of waiting for threads. In this case, multiple threads are performing the same function. When a thread is running, its HANDLE is non-signaled. When a thread exits, its thread HANDLE is signaled. The above code creates multiple threads, store the HANDLEs in the hThread array, and then wait for all threads to terminate by passing the HANDLE array to the WaitForMultipleObjects() function. This effectively prevents the main thread from terminating until all other work has finished executing. Clarify their queries (if any) related to this slide. 2018/9/19

Activity 1: “HelloThreads” Objective: Find and resolve a common data race problem involving parameter passing. Modify the previous example code to print out: Appropriate “HelloThreads” message. Unique thread number. Use for loop variable of CreateThread() loop. Sample Output: Introduction to the First Lab Activity. Explain the participants the objective for the activity. Provide them useful hints and tips. Provide a sample output. Clarify their queries (if any) related to the objectives. Hello from Thread #0 Hello from Thread #1 Hello from Thread #2 Hello from Thread #3 2018/9/19

What is printed for myNum? What’s Wrong? An example solution that may have been attempted by some of the students: DWORD WINAPI threadFunc(LPVOID pArg) { int* p = (int*)pArg; int myNum = *p; printf( “Thread number %d\n”, myNum); } . . . // from main(): for (int i = 0; i < numThreads; i++) hThread[i] = CreateThread(NULL, 0, threadFunc, &i, 0, NULL); Present an example solution that may have been attempted by some of the students to explain what may have gone wrong. Explanation: Problem is passing *address* of “i”; value of “i” is changing and will likely be different when thread is allowed to run than when CreateThread() was called. Involve the participants by asking them how this plays out when threads are running in parallel. Clarify their queries (if any) related to this slide. What is printed for myNum? 2018/9/19

Hello Threads Timeline Time chart that explains why the wrong value of myNum gets printed in the preceding code example: Time Main Thread () Thread 1 T0 i = 0 --- ---- T1 create(&i) --- --- T2 i++ (i == 1) launch --- T3 create(&i) p = pArg --- T4 i++ (i == 2) myNum = *p myNum = 2 launch Timing chart to explain why the wrong value of myNum is being printed in preceding code example. Run through the time steps of the table. Again this is just an example to emphasize timing issues. Notice that both threads (0, 1) get the same value for “num” because they both de-reference the same address for “myNum”. It very possible in this scenario that thread0 gets a value of myNum = 1, but thread1 will get a value of num = 2. But this is still incorrect; correct would be for the threads to get values of 0 and 1, respectively. Mention that this is called a “data race” when more than one thread accesses the same variable. Clarify their queries (if any) related to this slide. T5 wait print(2) p = pArg T6 wait exit myNum = *p myNum = 2 2018/9/19

Threading Problems in Multithreaded Applications Deadlocks Livelocks Granularity Load Imbalance Data Races Discuss the threading problems. After implementing multithreading in an application, you may experience problems due to nondeterministic scheduling and the interactions of threads when accessing shared resources. Therefore, not all code should be threaded and, when done, threading should be done carefully. You can use parallel execution to improve the performance of a concurrent application. However, you may encounter threading problems such as: Data Races Deadlocks Livelock Granularity Load Imbalance Clarify their queries (if any) related to this slide. Threading Problems in Multithreaded Applications 2018/9/19

Race Conditions Definition: Race Conditions occur as a result of the dependencies, in which multiple threads attempt to update the same memory location, or variable, after threading. They may not be apparent at all times. You can use Intel® Thread Checker to detect data races at the time of execution. The two possible conflicts that can arise as a result of data races are: Read/Write Conflict Write/Write Conflict Introduce/Review concept of a data race. Race conditions are the most common errors in concurrent programs. They occur as a result of the dependencies, in which multiple threads attempt to update the same memory location, or variable, after threading. It can be difficult to identify data race conditions. In writing multithreaded programs, understanding which data is shared and which is private becomes important for performance and program accuracy. The operating system can schedule a thread and its order of execution in a nondeterministic or seemingly random fashion. In case you do not coordinate and synchronize the interactions of threads, the operating system may eventually use an execution order that will yield incorrect results. A data race occurs when two or more threads access the same memory space and one of them writes the data and the other reads the same data concurrently, or two or more threads write the data at the same time. The two possible conflicts that can arise as a result are: Read/Write Conflicts Write/Write Conflicts Data race conditions may not be apparent at every stage of program execution. Therefore, identifying data races can be the biggest challenge while working with multithreaded applications. To detect data races at the time of execution, Intel provides a powerful threading tool, Intel® Thread Checker. Where the program results depend on the relative interactions of two or more threads, unsynchronized access to shared memory can introduce race conditions. Consider a situation where threads may be reading a location that is updated concurrently with a value from another thread. In such a scenario, you need to ensure that writes and reads are atomic. Otherwise, incorrect data may be read. Clarify their queries (if any) related to this slide. 2018/9/19

How to Avoid Data Races? The two ways by which you can prevent data races in multithreaded applications are: Scope variables to be local to each thread: Examples are: Variables declared within threaded functions Allocate on thread’s stack Thread Local Storage (TLS) Control concurrent access by using critical regions: Examples of synchronization objects that can be used are: Mutex Semaphore Event Critical section Present the two methods that can be used to avoid data races. The two most common ways by which you can avoid data races are: You can scope variables to be local to each thread. Therefore, if a thread has its own variables in its scope, it can modify those variables without considering that other threads are also modifying their own copy of the variable at the same time. The common examples of local variables are: Variables declared within threaded functions Variables allocated on the thread’s stack Thread Local Storage (TLS) You can also use mutually exclusive access to critical regions to avoid data races. A critical region is a particular block of code where multiple threads have concurrent access the same variable and threads are updating that variable, you can enclose those potentially sensitive regions of code in a critical region. Mutual exclusion enforces single thread at a time access to a critical region. Synchronization objects are used to encode mutually exclusive access to critical regions. Examples of synchronization objects that can be used are: Mutex Semaphore Event Critical Section Clarify their queries (if any) related to this slide. 2018/9/19

Solution: Local Storage Solution to the data races within the preceding example code by storing the variables locally is: DWORD WINAPI threadFunc(LPVOID pArg) { int myNum = *((int*)pArg); printf( “Thread number %d\n”, myNum); } . . . // from main(): for (int i = 0; i < numThreads; i++) tNum[i] = i; hThread[i] = CreateThread(NULL, 0, threadFunc, &tNum[i], 0, NULL); Offer a solution to the data races within the preceding example code. Generally, encapsulating the work to be done by threads and then creating the threads to execute can be relatively simple. The challenge in writing multithreaded applications lies in ensuring that the executing threads interact in an orderly and guaranteed manner and avoid potential problems, such as deadlocks and data corruption caused by data races. You can prevent data races by scoping variables to be local to each thread. This implies that you can store those variables locally and the values stored there will not be needed globally by all threads. You can declare the variables locally or use some API calls to protect them. Local storage maintains a local copy of the variable in each thread. You can solve the problem of passing the address of i by saving the current value of i in a location that will not change. Ensure that each thread gets a pointer to unique element of tNum array. This is not a complete local storage as it simply ensures that each thread accesses a separate memory location by using a different element of the array as a parameter to the function. myNum is local storage that each thread will have a separate copy. Clarify their queries (if any) related to this slide. 2018/9/19

Solution: Using Synchronization Objects The objects that are used to synchronize concurrent access requests to shared resources include: Mutexes Critical Sections Semaphores Events Discuss additional ways to prevent data races. Microsoft defines several different types of synchronization objects as part of the Windows API. Synchronization objects are additional ways to prevent data races. The objects that are used to synchronize concurrent access requests to shared resources include: Mutexes: A mutex is a synchronization object that can be held by a single thread. You can achieve mutual exclusion to critical regions by allowing only the thread holding the mutex to execute within the region. Critical Sections: The Windows Critical Section object is an intra-process mutex. Events: Multiple threads within an application need a mechanism that can be used for inter-thread signaling. Events are the synchronization objects provided by Microsoft that may be used for this purpose. Semaphores: A semaphore is a synchronization object that has an associated count. This count can be used to limit access of a particular critical region to a certain number of threads. Clarify their queries (if any) related to this slide. 2018/9/19

Windows Mutexes Windows Mutexes are: Kernel objects. Synchronization objects that can be held by a single thread. Shared between processes along with threads. Operations performed on Windows Mutexes are: CreateMutex() // To create a new mutex WaitForSingleObject() // To wait and lock a mutex ReleaseMutex() // To unlock a mutex Provide cursory details about the Windows Mutex. Mutex is an abbreviation used for mutual exclusion. It is a synchronization object that can be held by a single thread. You can achieve mutual exclusion to critical regions by allowing only the thread holding the mutex to execute within the region. To use a mutex, you must first make a call to the CreateMutex() function, which returns a HANDLE of mutex. The HANDLE of the mutex is used in the WaitForSingleObject() function to determine whether or not a thread may access a critical region. This means you use WaitForSingleObject() to lock the mutex. If the mutex is not locked, the mutex HANDLE will be in signaled state. When the WaitForSingleObject() function returns, the mutex HANDLE is in non-signaled state and the thread is considered to hold the mutex. You can unlock the mutex with the ReleaseMutex() function. When a thread is about to leave a critical region, it makes a call to the ReleaseMutex() function, which indicates that the thread is exiting the Critical Section. Mutexes are kernel objects. If required, a mutex can be established and shared between multiple processes. You can create it and share it between different processes along with threads. Creating a Mutex object inside the kernel involves large overheads. The mutex, being a kernel object, requires calls into kernel space. Therefore, the mutex has a heavy cost associated with its use. However, a lighter alternative with essentially the same functionality is also available. Clarify their queries (if any) related to this slide. 2018/9/19

Mutex Creation The function prototype for creating a mutex is: The CreateMutex() function has the following parameters: lpMutexAttributes bInitialOwner lpName HANDLE WINAPI CreateMutex( LPSECURITY_ATTRIBUTES lpMutexAttributes, BOOL bInitialOwner, LPCSTR lpName); // text name for object Discuss the parameters involved in the function for creating a mutex. lpMutexAttributes: This parameter is a pointer to a SECURITY_ATTRIBUTES structure. If the value for this parameter is NULL, the mutex gets a default security descriptor. bInitialOwner: If the value for this parameter is TRUE, the calling thread holds the mutex. However, if another process calls the CreateMutex() function in order to create a shared mutex or open the mutex after the mutex has been created, this parameter is ignored and the calling thread will not hold the mutex. lpName: The third parameter specifies the name of the Mutex object. If lpName is NULL, the Mutex object is created without a name. Clarify their queries (if any) related to this slide. 2018/9/19

Windows Critical Section Critical Sections are: Used to declare objects. Not kernel objects. Lightweight, user-space, intra-process Mutex objects. Operations performed on Windows critical sections are: IntializeCriticalSection(&cs) // To create a Critical Section DeleteCriticalSection(&cs) // To destroy a Critical Section void WINAPI InitializeCriticalSection( LPCRITICAL_SECTION lpCriticalSection ); Discuss the initial description of Windows Critical Section object and provide more details on use of Windows critical section type. Critical Sections are very useful and a better mutex mechanism. It is a new data type used to declare objects. Critical Sections must be initialized before using the first time. Critical Sections are not kernel objects. They are lightweight, user-space, intra-process Mutex objects. They can also be referred to as synchronization blocks. Critical Section is better than a mutex because it is not a kernel object, so there is less overhead when calling functions related to a Critical Section than there is with calls to manipulate a mutex. Critical Sections can not be shared between processes. Therefore, if you do not need to share resources between processes, then there is no reason to pay for the higher cost of a mutex. When a thread terminates holding the mutex, then that mutex is known as a dangling mutex. A dangling mutex can be recovered by the next thread that requests it. However, you cannot recover a dangling Critical Section. If a thread terminates holding the Critical Section, then that Critical Section is lost. The size of critical regions is important. If possible, larger critical regions should be split into multiple code blocks. This is particularly important for code that are likely to experience significant thread contention on synchronization objects protecting critical regions. Each critical region has an entry and an exit point. As a programmer, you can use CRITICAL_SECTION data structure in situations where you want to synchronize access by a group of threads in a single process. The semantics of using CRITICAL_SECTION objects are different than those of Mutex and Semaphore objects. Critical Sections run in user space, and do not incur the performance penalty of transferring control to the kernel to acquire a lock. The CRITICAL_SECTION API defines the following functions that operation on CRITICAL_SECTION objects: InitializeCriticalSection(&cs): This function is used to create a Critical Section object. The parameter lpCriticalSection is a pointer to the Critical Section object. InitializeCriticalSection() function does not return a value. The threads of a single process can use a Critical Section object for mutual-exclusion synchronization. The order in which threads will access the Critical Section is not guaranteed. DeleteCriticalSection(&cs): This function is used to destroy a Critical Section. It releases all resources used by an unowned Critical Section object. The parameter lpCriticalSection is a pointer to the Critical Section object. The object must have been previously initialized with the InitializeCriticalSection() function. DeleteCriticalSection() function does not return a value. Deleting a Critical Section object releases all system resources used by the object. If a Critical Section is deleted while it is still owned, the state of the threads waiting for ownership of the deleted Critical Section is undefined. Clarify their queries (if any) related to this slide. void WINAPI DeleteCriticalSection( LPCRITICAL_SECTION lpCriticalSection ); 2018/9/19

Windows Critical Section (Continued) EnterCriticalSection(&cs) // To enter the protected code LeaveCriticalSection(&cs) // To exit the Critical Section EnterCriticalSection(&cs): Blocks a thread if another thread is in critical region. Returns when no thread is in critical region. void WINAPI EnterCriticalSection( LPCRITICAL_SECTION lpCriticalSection ); void WINAPI LeaveCriticalSection( LPCRITICAL_SECTION lpCriticalSection ); Continue the discussion the initial description of Windows Critical Section object and provide more details on use of Windows critical section type. The CRITICAL_SECTION API defines the following functions that operation on CRITICAL_SECTION objects: EnterCriticalSection(&cs): This function is used to enter the protected code in the critical region. The parameter lpCriticalSection is a pointer to the Critical Section object. EnterCriticalSection() function does not return a value. LeaveCriticalSection(&cs): This function is used to exit the critical region. The parameter lpCriticalSection is a pointer to the Critical Section object. LeaveCriticalSection() function does not return a value. The EnterCriticalSection() function blocks the thread from entering the critical region if there is any other thread accessing the resources in the critical region. It returns when there is no thread in the critical region. Therefore, you can protect the access of shared, modifiable data of code by using EnterCriticalSection() function and LeaveCriticalSection() function. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Critical Section Consider the following example code demonstrating the use of Critical Section: #define NUMTHREADS 4 CRITICAL_SECTION g_cs; // why does this have to be global? int g_sum = 0; DWORD WINAPI threadFunc(LPVOID arg ) { int mySum = bigComputation(); EnterCriticalSection(&g_cs); g_sum += mySum; // threads access one at a time LeaveCriticalSection(&g_cs); return 0; } main() { HANDLE hThread[NUMTHREADS]; InitializeCriticalSection(&g_cs); for (int i = 0; i < NUMTHREADS; i++) hThread[i] = CreateThread(NULL,0,threadFunc,NULL,0,NULL); WaitForMultipleObjects(NUMTHREADS, hThread, TRUE, INFINITE); DeleteCriticalSection(&g_cs); Present example code demonstrating use of Critical Section. Point out features and function calls of example code. In the above code, the variable g_sum is updated within the critical region. Threads can access it only one at a time. Question for Discussion: Can you imagine why not just put bigComputation() declared into critical section? Answer: bigComputation() is not declared inside the critical section because the thread will exclude all the other threads from running their own, independent calls to bigComputation(). This will make the code serial. Clarify their queries (if any) related to this slide. 2018/9/19

Numerical Integration Example Information and code to compute Pi by numerical integration: static long num steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); 4.0 2.0 4.0 (1+x2) f(x) =  4.0 (1+x2) dx =  1 Present information and code to compute pi by numerical integration. Point out all the salient features of the source code. Questions for Discussion: Where is the bulk of the work found in the code? Where should we focus attention if we were to thread this code? Answers: The for-loop. Now ask the participants to thread this serial code. Clarify their queries (if any) related to this slide. 2018/9/19

Activity 2: Computing Pi Objectives: Thread the numerical integration example by parallelizing the numerical integration code by using Windows Threads. Find any data races and resolve threading errors by using Critical Sections. Questions for Discussuion: How can the loop iterations be divided among the threads? What variables can be local or what variables need to be visible to all threads? Introduction of the Second Lab Activity. Explain the objectives for the activity. This is a serial version of the source code. It uses a “sum” variable that could give a clue to an efficient solution, that is, local partial sum variable that is updated each loop iteration. This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this. Clarify their queries (if any) related to the objectives. 2018/9/19

Windows Events Windows events are used to signal other threads about events such as: Computations have completed. Data is available. Message is ready. Threads wait for signal by using the following Wait functions: WaitForSingleObject(): To wait for a single event WaitForMultipleObjects(): To wait for many or one of many events The two types of Windows events are: Auto-reset events Manual-reset events Present introduction to Windows event objects. An Event object can be used to notify one or more threads about an occurrence of some kind. This means that these Windows objects can be used to signal other threads about events such as computations have completed or data is available or a message is ready to be received. Events are kernel objects and are represented by a HANDLE. This HANDLE is not valid until you call the CreateEvent() function. An event can be in one of the two possible states: signaled and nonsignaled. When the state of the event shifts from nonsignaled to signaled, one or more threads waiting for the event are released to continue execution. Threads can wait for a single event by using the WaitForSingleObject() function, and wait for many events or one of many events by using the WaitForMultipleObjects() function. Clarify their queries (if any) related to this slide. 2018/9/19

Types of Windows Events Consider the following two types of Windows events: Auto-reset Events Manual-reset Events Event stays signaled until one waiting thread is released. Event stays signaled until explicitly reset to nonsignaled by an API call. If no thread is waiting, state remains signaled. If all waiting threads are released, state remains signaled. After the thread is released, state is reset to nonsignaled. Threads not originally waiting may start wait and be released. Present details on the two types of Events and how they differ. The two types of Windows events are: Auto-reset Events: An auto-reset event stays signaled until one waiting thread is released. The signal state is persistent, which means that the event will remain in the signaled state until one thread that has waited on the event has been released. After the thread is released, the state of the event is reset to nonsignaled. Manual reset Events: A manual reset event stays signaled until it is explicitly reset to nonsignaled by an API call. When a manual reset event is signaled, all waiting threads are released. Since the event must be manually reset to the nonsignaled state, any thread, other than those waiting for the event when the event was signaled, could enter a wait on the event, and will be released before the event is reset. Clarify their queries (if any) related to this slide. Caution: Be careful when using WaitForMultipleObjects() to wait for ALL events. 2018/9/19

Windows Event Creation The function prototype for the CreateEvent() function’s syntax is: If bManualReset is set to: TRUE: Manual-reset event FALSE: Auto-reset event If bInitialState is set to: TRUE: Event to begin in signaled state FALSE: Event to begin unsignaled HANDLE CreateEvent( LPSECURITY_ATTRIBUTES lpEventAttributes, BOOL bManualReset, // TRUE => manual reset BOOL bInitialState, // TRUE => begin signaled LPCSTR lpName); // Text name for object Present details along with parameter involved in the CreateEvent() function to create and initialize an event object. lpEventAttributes: This parameter is a pointer to a SECURITY_ATTRIBUTES structure. If the value for this parameter is NULL, the event gets a default security descriptor. bManualReset: The second parameter is used to specify the type of event: Manual reset or Auto-reset. If bManualReset is TRUE, a manual reset event will be created, which requires a call to the ResetEvent() function to return the Event object to the non-signaled state; if bManualReset is set to FALSE, an auto-reset event will be created, which returns the event to the non-signaled state after a single thread has waited on the event and been released. bInitialState: If bInitialState is TRUE, the event begins in the signaled state; and if bInitialState is FALSE, the event will begin in the non-signaled state. lpName: The programmer may specify a name for the event in the fourth parameter lpName. Providing a name creates a system-wide event. Clarify their queries (if any) related to this slide. 2018/9/19

Event Set and Reset The three functions related to setting the signaled state of events are: SetEvent() ResetEvent() PulseEvent() BOOL SetEvent( HANDLE event ); BOOL ResetEvent( HANDLE event ); Present details on the three functions related to setting the signaled state of Events. The threads communicate among themselves by using the signaling mechanism of events. You can send signals between threads by using the SetEvent() function, in conjunction with the wait routines, the WaitForSingleObject() function and the WaitForMultipleObjects() function. The three functions related to setting the signaled state of Events are: The SetEvent() is used to set an event to a signaled state. The ResetEvent() is used to reset a manual-reset event. The PulseEvent() function works differently on auto-reset and manual reset events. For manual reset events, all threads currently waiting on the event are released and the event is automatically reset; for auto-reset events, one waiting thread is released before the event is reset. In the case where no thread is waiting on either event type, the signal is cancelled and does not persist. This implies that only if a thread is waiting on an event that is pulsed, will the thread(s) be released. MSDN does not recommend the use of the PulseEvent() function. Clarify their queries (if any) related to this slide. BOOL PulseEvent( HANDLE event ); 2018/9/19

Example: Using Events Consider the following example of a threaded code for searching a thread by using events: HANDLE hObj [2]; // 0 is event, 1 is thread DWORD WINAPI threadFunc(LPVOID arg) { BOOL bFound = bigFind(); if (bFound) SetEvent(hObj[0]); // signal data was found bigFound(); } moreBigStuff(); return 0; Present global declarations and threaded function for example. In this example, the main thread will create a thread to search for some item. If the thread finds the item, it sends a signal to the main thread. The main thread will perform some other computation and then check on the progress of the searching thread. The main thread will print a message out if the item is found and will print another message upon termination of the searching thread. When the function is executed, it performs a search by using the bigFind() function. If the search is successful, the event is signaled and the found item is processed. If the item is not found, the event is not signaled and a different type of processing is done before the thread terminates. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Main Function Consider the FIRST HALF of the main routine for the following example: . . . hObj[0] = CreateEvent(NULL, FALSE, FALSE, NULL); hObj[1] = CreateThread(NULL,0,threadFunc,NULL,0,NULL); /* Do some other work while thread executes search */ DWORD waitRet = WaitForMultipleObjects(2, hObj, FALSE, INFINITE); switch(waitRet) { case WAIT_OBJECT_0: // event signaled printf ("found it!\n"); WaitForSingleObject(hObj[1], INFINITE) ; // fall through case WAIT_OBJECT_0+1: // thread signaled printf ("thread done\n"); break ; default: printf (“wait error: ret %u\n", waitRet); } Present FIRST HALF of the main routine for the example. examine a portion of the main() routine. Notice that the event is created and the thread is launched. Each HANDLE is assigned into separate elements of the hObj array. After the searching thread is started, the main thread might continue with some other execution. Eventually the main thread needs to determine whether the searching thread has been successful in its search. Within the call to the WaitForMultipleObjects() function, the third parameter, the WaitForAllObjects function has the value FALSE. This will cause the main thread to wait for any of the two HANDLES to become signaled. The lowest indexed HANDLE found to be signaled will trigger the waiting thread to be released. The return code will indicate which HANDLE was found in the signaled state. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Main Function (continued) Consider the SECOND HALF of the main routine for the following example: . . . hObj[0] = CreateEvent(NULL, FALSE, FALSE, NULL); hObj[1] = CreateThread(NULL,0,threadFunc,NULL,0,NULL); /* Do some other work while thread executes search */ DWORD waitRet = WaitForMultipleObjects(2, hObj, FALSE, INFINITE); switch(waitRet) { case WAIT_OBJECT_0: // event signaled printf ("found it!\n"); WaitForSingleObject(hObj[1], INFINITE) ; // fall through case WAIT_OBJECT_0+1: // thread signaled printf ("thread done\n"); break ; default: printf (“wait error: ret %u\n", waitRet); } Present SECOND HALF of the main routine for the example. The switch statement interprets the return code from the WaitForMultipleObjects call. If the item was found, WAIT_OBJECT_0 will be returned as the lowest indexed HANDLE that was signaled. After printing the message, the code waits on thread termination by using the WaitForSingleObject() function and falls through to the print for that event. If the item was not found, the thread termination will be the HANDLE that releases the main thread and only the thread termination message is printed. Questions for Discussion: What happens if the item was found and the search thread has terminated before the main thread calls the WaitForMultipleObjects() function? What changes will the code logic need if the order of the event and thread HANDLE were reversed in the hObj array? Clarify their queries (if any) related to this slide. 2018/9/19

Windows Semaphores A semaphore is a synchronization object that: Is used to allow more than one thread in a Critical Section. Is used to keep a count of number of available resources. Was formalized by Edsger Dijkstra in 1968. The two operations on semaphores can be represented as: Wait [P(s)]: Thread waits until s > 0, then s = s – 1 Post [V(s)]: s = s+1 Present concept information about semaphore synchronization object. Semaphore is a special type of variable used for synchronization. A semaphore can be used to allow more than one thread in a Critical Section. Alternatively, the semaphore is the synchronization object that is used to keep a count of the number of available resources. The synchronization object semaphore was introduced by Edsger Dijkstra in his 1968 paper, “The Structure of the Multiprogramming System”. The Semaphore object has an associated count, s, and manipulated by two basic atomic operations, Wait and Post. These operations were originally designated as P and V, respectively, by Dijkstra. The two operations on semaphores can be represented as: Wait [P(s)]: Thread waits until s > 0, then s = s – 1 Post [V(s)]: s = s+ 1 The Wait operation blocks the calling thread if the semaphore value is zero. The Post operation signals a waiting thread to allow it to resume operation. If there is no thread waiting, the semaphore value is incremented. The most common use for semaphores is to represents a fixed number of resources that are available to threads. When the value of the semaphore becomes zero, there are no available resources. Threads must then wait until another thread releases one of the resources and increments the semaphore count. Consider an example. There may be a restriction on the upper limit of the number of windows that an application can create. The application uses a semaphore with a maximum count equal to the window limit, decrements the count whenever a window is created, and increments the count whenever a window is closed. The application specifies the Semaphore object in a call to one of the wait functions before each window is created. When the count is zero, it indicates that it has reached the window limit. As a result, the Wait function blocks execution of the window-creation code. Clarify their queries (if any) related to this slide. 2018/9/19

Windows Semaphore Creation The function prototype for creating a semaphore is: The value of ISemInitial must be 0 <= count <= ISemMax. The value of ISemMax should be > 0. HANDLE CreateSemaphore( LPSECURITY_ATTRIBUTES lpEventAttributes, LONG lSemInitial, //Initial count value LONG lSemMax, //Maximum value for count LPCSTR lpSemName); //Text name for object Present details of the CreateSemaphore() function to initialize a semaphore object. lpSemaphoreAttributes: This parameter is a pointer to a SECURITY_ATTRIBUTES structure. If the value for this parameter is NULL, the event gets a default security descriptor. ISemInitial: The second parameter specifies the initial count for the Semaphore object. This value must be greater than or equal to zero and less than or equal to lSemMax. The state of a semaphore is signaled when its count is greater than zero and non-signaled when it is zero. ISemMax: This parameter specifies the maximum count for the Semaphore object. This value must be greater than zero. IpSemName: This parameter specifies the name of the Semaphore object. Clarify their queries (if any) related to this slide. 2018/9/19

WAIT and POST Operations You can use the WaitForSingleObject() function to wait on semaphores. For WaitForSingleObject(): If semaphore count = 0, thread waits. If semaphore count > 0, decrement the count by 1 and return. Consider the following code to increment semaphore (Post Operation): Increase semaphore count above zero by ReleaseSemaphore(). Release the previous count through lpPreviousCount. BOOL ReleaseSemaphore( HANDLE hSemaphore, LONG cReleaseCount, LPLONG lpPreviousCount); Describe how the Wait and Post operations are implemented in Windows Threads. You can use the WaitForSingleObject() function to wait on semaphores. The WaitForSingleObject() function will block the calling thread if the semaphore count is equal to zero. This means that if the count is equal to zero, the thread waits. After another thread calls the ReleaseSemaphore() function to increment the count above zero, the WaitForSingleObject() function will decrement the count by 1 and return, allowing the calling thread to proceed. If more than one thread is waiting on the same semaphore, the number of threads that can resume execution will be the lesser of the number of threads waiting and the value of the semaphore count. If the count is already greater than zero when called by a thread, the count is decremented and the thread does not get blocked. The ReleaseSemaphore() function is able to increment the count by a value other than one. If the count of the semaphore is greater than the initialized maximum, the call fails and returns FALSE; no adjustment to the count will be made. You can use the lpPreviousCount to return the previous semaphore value. However, ensure to exercise caution with this value because other threads may affect the count of the semaphore, which could negate the utility of the returned count. Clarify their queries (if any) related to this slide. 2018/9/19

Semaphore Uses Semaphores can be used: To control access to data structures of limited size, such as queues, stacks, deques. To limit the number of threads executing within a given region of code. To control access to finite number of resources such as file descriptors and tape drives. Binary semaphore [0,1] can act as a mutex. Present some common use scenarios for semaphores. Semaphores can be used to control access to data structures of limited size, such as queues, stacks, and deques. Implementing a queue with an array will limit the number of elements that can reside in the queue. You need to initialize the semaphore with the maximum queue length. You can use the WaitForSingleObject() function before placing something new on the queue. If the queue is full, this will block the thread attempting to place an item in the queue. You can use ReleaseSemaphore() when removing something off the queue, which will decrement the count and allow a waiting thread to proceed. If the performance is affected by too many threads being active, perhaps on Hyper-Threading systems, a semaphore can be used to limit the number of threads executing within a given region of code. You can use semaphores to control access to finite number of resources such as file descriptors and tape drives. A binary semaphore, one whose range of values is [0, 1], can act as a mutex. A binary semaphore can replace a mutex or CRITICAL_SECTION, however, since the semaphore is a kernel object, it will have overheads similar to the mutex. Clarify their queries (if any) related to this slide. 2018/9/19

Semaphore Cautions The various cautions and warnings about the use of semaphores are: Any thread can release a semaphore, not just the last waiting thread. There is no concept of ownership with semaphores. Code is safer when semaphores are programmed with matching WaitForSingleObject() and ReleaseSemaphore() function calls by the same thread. Provide some cautions and warnings about the use of semaphores. Semaphores must be used carefully. You should keep the following cautions and warnings related to the usage of semaphores in mind when deciding to use the synchronization object: Any thread can release a semaphore. It is not necessary that the last waiting thread releases the semaphore. Therefore, a thread in a completely different part of the program that has nothing to do with the region of the code protected by the semaphore can call ReleaseSemphore() function. It is a programmer’s error. To avoid this, like a lock object, semaphores should be programmed with matching wait and release operations by the same thread. This is a good programming practice. With a mutex, the concept of ownership exists; however, there is no concept of ownership with semaphores. If you own a mutex, no other thread can acquire it until it has been unlocked. If a thread terminates while holding the mutex, the mutex can be locked by another thread. The return code from the Wait functions call will indicate that the mutex was abandoned. Even though the semaphore is also a kernel object, there is no facility to recognize an abandoned semaphore. Code is safer when semaphores are programmed with matching WaitForSingleObject() function and ReleaseSemaphore() function calls by the same thread. However, this practice can lead to problems if a thread terminates before the release operation because at that time the count will be off from the expected value. As a result, there may be a deadlock or lack of performance. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Semaphore as Mutex Consider the following example code that demonstrates the semaphore as a mutex: HANDLE hSem1, hSem2; FILE *fd; int fiveLetterCount = 0; main() { HANDLE hThread[NUMTHREADS]; hSem1 = CreateSemaphore(NULL, 1, 1, NULL); // Binary semaphore hSem2 = CreateSemaphore(NULL, 1, 1, NULL); // Binary semaphore fd = fopen(“InFile”, “r”); // Open file for read for (int i = 0; i < NUMTHREADS; i++) hThread[i] = CreateThread(NULL,0,CountFives,NULL,0,NULL); WaitForMultipleObjects(NUMTHREADS, hThread, TRUE, INFINITE); fclose(fd); printf (“Number of five letter words is %d\n”, fiveLetterCount); } Provide declarations, global declarations, and main routine for example. You can use a binary semaphore as a mutex. Consider an example code that demonstrates the semaphore as a mutex. The purpose of the code will be to read lines from a file and count the number of five-letter words that are contained in the file. The code globally defines two semaphores, a file pointer, and an integer to hold the total number of five-letter words that are found in the input file. The main routine creates the semaphores (initial value of 1, maximum value of 1), opens the file for reading, creates threads, waits for all threads to terminate, and prints out results. Clarify their queries (if any) related to this slide. 2018/9/19

Example: Semaphores Consider the following threaded function as an example code demonstrating implementation of semaphores: DWORD WINAPI CountFives(LPVOID arg) { BOOL bDone = FALSE ; char inLine[132]; int lCount = 0; while (!bDone) { WaitForSingleObject(hSem1, INFINITE); // access to input bDone = (GetNextLine(fd, inLine) == EOF); ReleaseSemaphore(hSem1, 1, NULL); if (!bDone) if (lCount = GetFiveLetterWordCount(inLine)) { WaitForSingleObject(hSem2, INFINITE); // update global fiveLetterCount += lCount; ReleaseSemaphore(hsem2, 1, NULL); } Present threaded function code for example. In the threaded code, the semaphores are used like a mutex to protect access to the input file, such that only one thread at a time may read from the file, and to update the fiveLetterCount global variable. GetNextLine and GetFiveLetterWordCount will be the user-defined functions. Clarify their queries (if any) related to this slide. 2018/9/19

Activity 4: Using Semaphores Objectives: Identify the global data accessed by threads. Resolve data races by using binary semaphores. The main idea of this activity is to develop the ability to use binary semaphores to control access to shared variables. Introduction to the Fourth Lab Activity. Explain the objectives of the activity to the participants. Clarify their queries (if any) related to the objectives. 2018/9/19

Summary A Windows HANDLE is an opaque kernel object that should be manipulated or modified only by Windows API functions. You can create threads to execute work encapsulated within functions. When a thread terminates, its resources are not released until all open Handles to the thread are closed. The Wait functions are used to wait on the thread synchronization or other kernel objects. The commonly used functions are WaitForSingleObject() and WaitForMultipleObjects(). Race conditions are the most common errors in concurrent programs. They occur as a result of execution dependencies, in which multiple threads attempt to update the same memory location or variable. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) You can prevent data races by scoping the variables local to each thread or by using synchronization objects, such as events, semaphores, mutexes, and Critical Sections, to enforce mutual exclusion on critical regions. Critical Sections are lightweight, user-space, intra-process Mutex objects. Critical region is a section of the code block where shared dependent variables are updated. These shared variables have dependency among multiple threads. Windows events can be used for inter-thread signaling. A semaphore is an extension of a mutex and may be used to allow more than one thread in a critical region. Alternatively, a semaphore is the synchronization object that can be used to keep a count of the number of available resources. Summarize all the key points learned in the chapter. 2018/9/19

Programming with OpenMP Chapter 4 Programming with OpenMP 2018/9/19

Introduction to OpenMP Definition: OpenMP, Open specifications for Multi Processing, is an application programming interface (API) used for writing portable, multithreaded applications. OpenMP has the following features: Provides a set of compiler directives—data embedded in source code—for multithreaded programming Combines serial and parallel code in single source Supports incremental parallelism Supports coarse-grain and fine-grain parallelism Supports and standardizes loop-level parallelism Introduce OpenMP by stating that OpenMP, Open specifications for Multi Processing, is an application programming interface (API) that was formulated in 1997. It is used for writing portable, multithreaded applications. OpenMP has made it easy to create threaded Fortran and C/C++ programs. Unlike Windows threads, OpenMP does not require you to create, synchronize, load balance, and destroy threads. OpenMP is not formally accepted as an industry standard for programming, but it is used widely. OpenMP provides a set of compiler directives—instructions embedded in source code—that informs compilers about the intent of compilation. Compiler directives tell the compiler how to compile. For example, the #include <stdio.h> compiler directive informs the compiler to include the stdio.h file and compile the code using the contents of this header file. Earlier, shared-memory platform vendors had their own sets of compiler directives to implement shared-memory parallelism. OpenMP standardizes almost 20 years of compiler-directed threading experience by combining common functionality from various vendors into a portable API. OpenMP provides a special syntax that programmers can use to create multithreaded code. It enables you to combine serial and parallel code in a single source. If your compiler does not recognize OpenMP directives, the compiler ignores the directives and the code enclosed within the directive. Such a situation yields a serial executable code. OpenMP also supports incremental parallelism. Incremental parallelism allows you to first thread one segment of the code and tests it for better performance. You can then successively modify and test other code segments. This helps you give priority to threading critical problems in a code segment. If threading results in worse performance, you can remove the new OpenMP directives and proceed to the next code segment. OpenMP also supports both coarse-grain and fine-grain parallelism. OpenMP also supports and standardizes loop-level parallelism. It uses special syntax to parallelize loops where performance hotspots or bottlenecks have been identified. As a result, threads can execute different iterations of the same loop simultaneously. For the participants who want to learn more about OpenMP, recommend them to visit http://www.openmp.org. Highlight that the current specification of OpenMP is OpenMP 2.5. Clarify queries (if any) of the participants related to this slide. ! Note For more information on OpenMP, visit http://www.openmp.org. The current specification of OpenMP is OpenMP 2.5. 2018/9/19

OpenMP Directives Some commonly used OpenMP directives for C are: C$OMP SINGLE PRIVATE(X) #pragma omp critical #pragma omp parallel C$omp parallel reduction(+:A,B) Some commonly used OpenMP directives for Fortran are: #pragma omp parallel for private(A,B) CALL OMP_SET_NUM_THREADS(10) C$OMP SECTIONS !$OMP BARRIER Introduce some commonly used OpenMP directives: C$OMP SINGLE PRIVATE(X) #pragma omp critical #pragma omp parallel C$omp parallel reduction(+: A,B) #pragma omp parallel for private(A,B) CALL OMP_SET_NUM_THREADS(10) C$OMP SECTIONS !$OMP BARRIER Clarify queries (if any) of the participants related to this slide. 2018/9/19

Fork-Join Model Fork-join model is the parallel programming model for OpenMP where: Master thread creates a team of parallel worker threads. Parallelism is added incrementally; the sequential program evolves into a parallel program. Master Thread Parallel Regions F O R K J I N Introduce the slide by saying that OpenMP is based on the fork-join model that facilitates parallel programming. Every threading activity in OpenMP follows the fork-join model. Explain the fork-join model with the help of the figure on the slide: All programs based on the fork-join model begin as a single process. In Figure 4.2, the application starts executing serially with a single thread called the master thread. The execution of parallel regions of a program following the fork-join model proceeds through the following two stages: Fork: When the master thread reaches a parallel region or a code segment that must be executed in parallel by a team of threads, it creates a team of parallel threads called worker threads that execute the code in the parallel region concurrently. Join: At the end of the parallel region, the worker threads synchronize and terminate, leaving only the master thread. In an OpenMP implementation, threads reside in a team or thread pool. A thread pool refers to a common set of threads to which tasks to be processed are assigned. The thread that is assigned to a task completes the task and returns to the pool to wait for the next assignment without terminating. When the threads are assigned to a parallel region that has more work than a previous parallel region, additional threads can be added at run time to the thread pool. Threads are not destroyed until the end of the last parallel region. A thread pool can prevent your machine from running out of memory because a small number of threads are created and scheduled to execute the tasks in parallel, as needed, rather than continuously creating threads in your program for each task. Continuous creation of threads can result in out of memory errors. The fork-join method enables incremental parallelism. You do not need to thread the entire algorithm. For example, if there are three hotspots in the code, you can concentrate on the hotspots one by one based on their severity levels. Clarify queries (if any) of the participants related to this slide. Fork-Join Model 2018/9/19

OpenMP Fundamentals The other fundamental parts of OpenMP are: Parallel constructs Work-sharing constructs Data environment constructs Synchronization constructs Extensive API library for finer control Introduce the other fundamental parts of OpenMP: Parallel constructs: Indicate to the compiler to use multiple threads to facilitate parallel execution of code. Work-sharing constructs: Indicate to the compiler to generate code to automatically distribute workload across the team of threads. Data environment constructs: Control data environment during the execution of parallel constructs by making variables shared or private. Synchronization constructs: Ensure the consistency of shared data and synchronize threads in parallel execution to avoid data races. Extensive API library for finer control: Contains features that provide implementation-specific capabilities and run-time control. Clarify queries (if any) of the participants related to this slide. 2018/9/19

#pragma omp construct [clause [clause]…] OpenMP Pragma Syntax Most parallelism in OpenMP is specified through the use of compiler directives or pragmas. The OpenMP pragma syntax for C and C++, and Fortran are: For C and C++, the pragmas take the form: For Fortran, you use other sentinels: c$omp *$omp !$omp #pragma omp construct [clause [clause]…] Introduce the OpenMP pragma syntax by saying that most parallelism in OpenMP is specified through the use of compiler directives or pragmas, which are embedded in the C/C++ or Fortran source code. All OpenMP directives begin with a sentinel containing omp. Introduce the pragma for C and C++. Mention the other sentinels that can be used with Fortran. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Structured and Unstructured Code Blocks Understanding Parallel Region Constructs Parallel Region Constructs: The parallel region construct is used to parallelize serial code. A team of worker threads is created to execute the code in parallel. Parallel regions refer to structured blocks of code that can be executed in parallel by a team of threads. A parallel region is specified by the parallel pragma. #pragma omp parallel { more: res(id) = do_big_job(id); if (conv(res[id])) goto more; } printf(“All done\n”); A Structured Block if (conv(res[id])) goto done; goto more; done: if (!really_done()) goto more; An Unstructured Block Explain parallel regions constructs as constructs used to parallelize serial code by creating a team of worker threads that execute the code in parallel. Parallel regions refer to structured blocks of code that can be executed in parallel by a team of threads. Structured blocks refer to the code segments that have one entry point at the top and one exit point at the bottom. You can specify a parallel region by using the omp parallel pragma. The omp parallel pragma operates over a single statement or block of statements enclosed within curly braces. Explain structured blocks with the help of the figures on the slide. Explain that the unstructured block in the figure allows the execution to jump out of the parallel region and to jump into the parallel region from outside the region. A structured block does not allow jumps into or out of the block in this manner. When compiling for OpenMP, the compiler detects the instances where the execution jumps into or out of parallel regions and does not compile those code segments. As a result, a compiler error is generated. The only jumps that the compiler allows are STOP statements in Fortran and exit() functions in C/C++. Variables within the parallel region are shared variables by default. However, there are exceptions to these variables, such as loop index variables, which are implicitly made private. All threads can automatically access these variables. You can also change this access to have a local copy of a variable, known as private variable, attached to each thread. Clarify queries (if any) of the participants related to this slide. Structured and Unstructured Code Blocks 2018/9/19

Parallel Region Execution #pragma omp parallel Thread 1 2 3 Master Thread Implicit Barrier Explain the execution in parallel regions with the help of the figure on the slide. By the fork-join paradigm, you can assume that worker threads are created at the beginning of a parallel region. These worker threads reconcile and terminate at the end of the region. The OpenMP team of threads is created only once when the first parallel region starts to execute. At the end of a parallel region, the team of threads is put to sleep. After the first parallel region, all other parallel regions encountered will wake the team of threads to begin execution of the code within the region. This scheme is more economic to your application’s parallel performance than creating and destroying threads at each parallel region. In the figure, #pragma omp parallel indicates the beginning of a parallel region. When the parallel region begins, the code enclosed within the braces is executed by the worker threads. Each of the three worker threads gets executes the statements within the code. Each worker thread executes only the path of statements that it needs to process. All parallel regions have an implicit barrier at the end. At the implicit barrier, all the worker threads need to synchronize and wait for the other threads to reach that point in the code. When all the threads reach the implicit barrier, they reconcile and only the master thread continues execution beyond the barrier. If any worker thread terminates within a parallel region, all the worker threads in the team terminate. In this case, any work done by the team prior to the last barrier crossed is guaranteed to be complete. However, the work done by each worker thread after the last barrier and before the termination of the thread is unspecified. Threads are numbered from 0 (master thread) to N-1. Clarify queries (if any) of the participants related to this slide. Master Thread Parallel Region Execution 2018/9/19

Activity 1: “HelloWorlds” Objective: Use the most common OpenMP C statements to parallelize serial code. Questions for Discussion: Why is the output different between the two versions of the application that were created and run? Why do we not get multiple prints of the same iteration number in the second version? Introduction to the First Lab Activity. Explain to the participants the objective for the activity. Question to Ask Participants Why is the output different between the two versions of the application that were created and run? Why do we not get multiple prints of the same iteration number in the second version? The for-iteration variable is shared when not included in the scope of the pragma. Thus, each thread is incrementing the single iteration variable from the multiple, concurrent for-loops being executed. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Understanding Work-Sharing Constructs A work-sharing construct divides the work across multiple threads. The work-sharing construct is defined within a parallel construct. Consider the following example of using the for directive: The for Directive #pragma omp for i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 #pragma omp parallel Team Master Thread Implicit Barrier i = 0 i = 1 i = 2 i = 3 #pragma omp parallel #pragma omp for for (i=0; i<12; i++) c[i] = a[i] + b[i]; Introduce work-sharing constructs by saying that a work-sharing construct divides the work across multiple threads. The work-sharing construct is defined within a parallel construct. A work-sharing construct does not launch new threads. Instead, it distributes the execution activity among the worker threads created by the parallel construct. Next, introduce the for directive: The omp for directive is used to identify work-sharing constructs. The compiler distributes the iterations of the loop immediately following the omp for directive among the worker threads of the enclosing parallel region. The worker threads then execute the assigned for-loop iterations in parallel. Present the syntax for the for directive: #pragma omp parallel #pragma omp for for (i=0; i<N; i++) { Do_Work(i); } The omp for directive must be inside a parallel region, and it precedes the for-loop. Explain the execution of the for directive with the help of the figure on the slide. To understand the functioning of the omp for directive, consider the following code example: #pragma omp for for (i=0; i<12; i++) c[i] = a[i] + b[i]; In this code example, the for-loop has 12 iterations. If there are three worker threads available in the parallel region, the iterations are split among these three threads. Each thread is assigned an independent set of iterations. The figure displays a static division of iterations based on the number of threads. The master thread forms a team of three worker threads. Each worker thread performs 4 out of the total 12 iterations. There is an implicit barrier at the end of the omp for directive. The threads must wait at the end of a work-sharing construct at this implicit barrier. No thread can execute further until all threads in the team reach the barrier. The code following the for-loop may rely on the results of the computations within the loop. In the serial code, the for-loop completes before proceeding to the next computation. Therefore, the barrier at the end of the construct is enforced to maintain serial consistency. Clarify queries (if any) of the participants related to this slide. 2018/9/19

! Combining Pragmas Both these code samples perform the same function: #pragma omp parallel { #pragma omp for for (i=0; i< MAX; i++) res[i] = huge(); } #pragma omp parallel for for (i=0; i< MAX; i++) { res[i] = huge(); } Code Segment 2 Code Segment 1 State that both the code snippets use the for directive and perform the same function. Highlight that if there is parallel work that the threads need to do before or after the for-loop, you need to use two separate directives. However, if you only need the for-loop to be executed in parallel, you can combine the omp parallel and the omp for directives in a single directive as #pragma omp parallel for. Clarify queries (if any) of the participants related to this slide. ! Note If there is parallel work that the threads need to do before or after the for loop, you need to use Code Segment 1. 2018/9/19

Understanding Data Environment Constructs Control data environment during the execution of parallel regions. Help define the scope of data variables. Help eliminate data races on shared variables by declaring them as private variables. OpenMP uses a shared-memory programming model. Shared variables for C/C++ are: File scope variables Static variables Define data environment constructs: Data environment constructs help control the data environment during the execution of parallel regions. These constructs help define the scope of data variables. They also help eliminate data races on shared variables by declaring them as private variables. Explain the concept of shared variables by stating that OpenMP uses a shared-memory programming model. Most variables are shared by default. Explain the code snippet on the slide. In the code snippet, the variable i is a shared global variable. Mention that, for C/C++, the file scope and static variables are shared. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Exceptions to Shared Variables Some exceptions to variables that are shared are: Stack variables Automatic variables Loop index variables Present the exceptions to variables that are shared: Stack Variables: Stack variables in functions that are called from parallel regions are private. Actual parameters and variables declared locally are placed on the stack. Automatic Variables: Automatic variables within a statement block are private. These variables are primarily related to Fortran. Loop Index Variables: Loop index variables in work-sharing constructs are private. In C/C++, the first loop index variable in nested loops following a #pragma omp for is private. Clarify queries (if any) of the participants related to this slide. 2018/9/19

default(shared | none) Data Scope Attributes Data scope variables help change the default characteristic of variables. The syntax for changing the default status of variables is: Data scope attributes support the following two clauses: private shared default(shared | none) Define data score attributes as attributes that help change the default characteristic of variables. The data scope attributes are specified at the start of a parallel region or work-sharing construct. Present the syntax for changing the default status of variables. default(shared | none) The default clause enables you to control the sharing attribute of variables that are referenced in a parallel construct. Sharing attributes of such variables are determined implicitly. Further, explain that data scope variables support two clauses: private shared Clarify queries (if any) of the participants related to this slide. 2018/9/19

private(variable_name,…) The Private Clause The private clause uses the following syntax to produce copies of variables for each thread: private(variable_name,…) void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for (i=0; i<N; i++) x = a[i]; y = b[i]; c[i] = x + y; } Present the syntax on the slide. private(variable_name,…) Any variable declared outside the parallel region is shared in the parallel region by default. Therefore, without the private clause, the x and y variables are shared among all the threads. Any thread can use them randomly. As a result, these variables are populated and summed without any order. In such a case, the resultant vector c[] has a random value. You can prevent such problems by giving every thread a local copy of the variables. To do this, use the private clause. The private clause creates copies of variables for each thread. Explain the code example: The private clause declares variables x and y as private variables. Private variables are not initialized when they are included in the parallel region. Therefore, the assignment of values to x and y in this code is done before reading the values. The private variables are destroyed at the end of the construct for which they are declared. The shared copies of the private variables are undefined at the end of the construct. Therefore, the values of x and y before the function returns are undefined and should not be relied upon. Clarify queries (if any) of the participants related to this slide. 2018/9/19

shared(variable_name,…) The Shared Clause The shared clause uses the following syntax to enable multiple threads to access variables: shared(variable_name,…) float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for (int i=0; i<N; i++) sum+ = a[i] * b[i]; } return sum; Explain the shared clause with the help of the code snippet on the slide. Highlight that, in the code snippet, sum is declared as a shared variable. Multiple threads read and update sum. This leads to a data race. Therefore, the resultant is likely to be incorrect. Since the default is for all variables to be shared in OpenMP parallel regions, the shared clause is rarely needed. You would use it to access the shared copy of a variable within a parallel region that has declared the variable to be private or you would use it to document that the variables in the parallel region are shared. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Private Clause Cannot be Used in all Cases Consider the following example code for calculating dot product of two vectors: float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for for (int i=0; i<N; i++) sum+ = a[i] * b[i]; } return sum; Explain the example code on the slide. Consider a code example in which you need to multiply the elements from two one-dimensional vectors and determine the total of all the products. This is known as the dot product of the two vectors. Assume that there are two threads and the value of N is 10. Then, the first thread computes the sum for the first five iterations and the second thread computes the sum for the remaining five iterations. In the above code snippet, sum is a shared variable by default. Multiple threads read and update sum. This leads to a data race. Therefore, the resultant is likely to be incorrect. To avoid this, you need to protect the shared data. You cannot declare sum as a private variable because the dot_prod() function needs the value of sum after the parallel region. Below you will see different methods to protect the sum variable in order to compute the correct answer. Clarify queries (if any) of the participants related to this slide. What is Wrong? 2018/9/19

#pragma omp critical [(lock_name)] OpenMP Critical Construct The critical construct uses the following syntax to define a critical region in a structured block: #pragma omp critical [(lock_name)] float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for for (int i=0; i<N; i++) #pragma omp critical sum+ = a[i] * b[i]; } return sum; Introduce the slide by saying that you can protect shared variables by using the critical construct. You can protect shared variables by using the critical construct. The critical construct defines a region in a structured block as a critical region. Only one thread is allowed to execute the code in the critical region at a time. Present the syntax for the critical construct: #pragma omp critical[(lock_name)] Explain that the critical construct allows only one thread to execute the enclosed code at a time. Consider a situation in which a thread is executing inside a critical region and another thread reaches that critical region and attempts to execute it. The critical construct blocks the second thread from entering the region until the first thread exits. Explain the code snippet on the slide: Consider a situation in which a thread is executing inside a critical region and another thread reaches that critical region and attempts to execute it. The critical construct blocks the second thread from entering the region until the first thread exits. In the above code example, sum is enclosed in the critical construct, which allows only one thread at a time to update sum. This prevents simultaneous access of sum by multiple threads. Clarify queries (if any) of the participants related to this slide. 2018/9/19

OpenMP Critical Construct (Continued) Consider the following example that uses the critical construct: float R1, R2; #pragma omp parallel { float A, B; #pragma omp for for (int i=0; i<niters; i++) B = big_job(i); #pragma omp critical consum(B,&R1); A = bigger_job(i); #pragma omp critical consum(A, &R2); } } Present another example that uses the critical construct: In the example, all unnamed critical regions are treated as the same region. In the above example, threads await their turn to enter either critical region since only one thread calls the consum() function. Thus, if there is a thread executing the consum(B,&R1) call, no other thread will be allowed to execute in this critical region, but, also, no thread will be allowed to execute in the region containing the consum(A,&R2) function call. This protects the shared variables R1 and R2 from race conditions. However, since one critical region only updates R1 and the other only updates R2, there is no need to exclude threads from the second consum() call because another thread is executing the first consum() call. You can also assign names to critical regions. Only one thread is allowed to execute within each named region. Naming the critical constructs allows multiple threads to exist in different critical regions. It also enhances the parallel performance of your application. Clarify queries (if any) of the participants related to this slide. 2018/9/19

reduction(operator:variable_list) OpenMP Reduction Clause Consider the following example where the reduction clause is useful: float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for for (int i=0; i<N; i++) #pragma omp critical sum+ = a[i] * b[i]; } return sum; } Present the example code on the slide where the reduction clause is useful. Explain that the threads compute sum in parallel for some iterations and combine the result of the computations while exiting. Present the syntax for the reduction clause: #pragma omp reduction(operator:variable_list) Explain the usage of the reduction clause: In the above code snippet, it appears that the threads compute sum in parallel for their assigned iterations. However, since the only computation within the loop is protected by a critical region, there is no chance that any computation will be done in parallel, no matter how many threads or cores are used. Computing a single value from a collection of data is a common computational activity, known as a reduction. The dot product of two vectors is a reduction. Since reduction computations are so common, OpenMP has a clause to compute reductions of data in parallel. When executing with a reduction clause, private copies of all the variables in the list are created for each thread. These copies are initialized, depending on the operator given in the clause, and updated locally. At the end of the construct, the local copies combine through the operator into a single value and this combined value is stored in the original shared variable. Highlight that the variables in the variable_list must be shared in the enclosing parallel region. Clarify queries (if any) of the participants related to this slide. reduction(operator:variable_list) 2018/9/19

Example – Reduction Clause Consider the following example code for computing the dot product of two one-dimensional arrays using the reduction clause: The ways in which the iterations of the for loop are distributed in equal-sized blocks among the threads in the team are: If there are four threads, four copies of sum are maintained. At the end of the parallel construct, the four copies of sum will combine. The master thread’s global copy of sum will be updated. #pragma omp parallel for reduction(+:sum) for (i=0; i<N; i++) { sum+ = a[i] * b[i]; } Present an example code for computing sum using the reduction clause. In the example, you compute the value of sum by using the reduction clause. If there are four threads, four private copies of sum are maintained, each initialized to 0. At the end of the parallel construct, the four copies of sum are combined by addition. This final value of sum is stored in the master thread’s global copy of sum. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Reduction Operations Operator Initial Value + * 1 - ^ & ~0 | && || You can use the following range of associative and commutative operators with the reduction clause: Operator Initial Value + * 1 - ^ & ~0 | && || Present the various associative and commutative operators that can be used with the reduction clause. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Reduction Operations (Continued) Some statements where you can typically use the reduction clause variables: In the above statements: x is a scalar variable in the list. expr is a scalar expression that does not reference x. op is not overloaded and is one of +, *, -, /, &, ^, |, &&, and ||. binop is not overloaded and is one of +, *, -, /, &, ^, and |. x = x op expr x binop = expr x = expr op x \\except for subtraction x++ ++x x-- --x Present statements where the reduction clause can be used. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Activity 2: Computing Pi Objective: Compile and run an OpenMP program. Questions for Discussion: What variables can be shared? What variables need to be private? What variables should be set up for reductions? Introduction of the Second Lab Activity. Explain to the participants the objective for the activity. This is a serial version of the source code. It uses a sum variable that could give a clue to an efficient solution, that is, local partial sum variable that is updated each loop iteration. This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Understanding the Schedule Clause Facilitates work sharing. Describes how iterations of a loop are divided among the threads in the team. The three main scheduling clauses are: static dynamic guided Describe the schedule clause as a clause that facilitates work sharing. It describes how iterations of a loop are divided among the threads in the team. Introduce the three main scheduling clauses: static dynamic guided Clarify queries (if any) of the participants related to this slide. 2018/9/19

schedule(static [,chunk]) The Static Schedule The syntax for the static scheduling clause is: The static scheduling clause: Divides loop iterations into pieces of size chunk. Statically assigns blocks to threads in a round-robin manner. schedule(static [,chunk]) Present the syntax for the static scheduling clause: schedule(static [,chunk]) The static scheduling clause divides loop iterations into pieces of size chunk and statically assigns them to the threads in a round-robin manner before the loop execution begins. The last chunk to be assigned may have a smaller number of iterations. If the chunk size is not specified, the static schedule divides the iterations into chunks that are approximately equal in size and distributes iterations among the threads in the team so that no thread has more than one chunk. Clarify queries (if any) of the participants related to this slide. If the chunk size is not specified, the static schedule divides the iterations into chunks that are approximately equal in size and distributes iterations among the threads in the team so that no thread has more than one chunk. ! Note 2018/9/19

schedule(dynamic [,chunk]) The Dynamic Schedule The syntax for the dynamic scheduling clause is: The dynamic scheduling clause: Divides loop iterations into pieces of size chunk. Assigns blocks according to the just-completed thread. When the threads complete the iterations, they grab the next set of iterations dynamically. schedule(dynamic [,chunk]) Present the syntax for the dynamic scheduling clause: schedule(dynamic [,chunk]) The dynamic scheduling clause divides loop iterations into pieces of size chunk and initially assigns one chunk to a thread. When a thread completes its iterations, it dynamically grabs the next chunk. The default chunk size for the dynamic scheduling clause is 1. Clarify queries (if any) of the participants related to this slide. ! Note The default chunk size for the dynamic scheduling clause is 1. 2018/9/19

schedule(guided [,chunk]) The Guided Schedule The syntax for the guided scheduling clause is: The guided scheduling clause: Dynamically assigns iterations larger than the chunk size. Starts with large iteration blocks. Has the default chunk size as 1. schedule(guided [,chunk]) Present the syntax for the guided scheduling clause: schedule(guided [,chunk]) The guided scheduling clause initially assigns iterations larger than the chunk size. It starts with large blocks and assigns them to threads dynamically. The assigned chunk size is decreased exponentially with each succeeding assignment until it reaches the chunk size specified. The default chunk size for the guided scheduling clause is 1. Inform the participants about the runtime schedule. In addition to the static, dynamic, and guided scheduling clauses, there is another scheduling clause called the runtime schedule. The runtime scheduling clause takes a schedule and chunk from the string assigned to the environment variable OMP_SCHEDULE at run time. You can find the best schedule empirically and then code it into the application source. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Which Schedule to Use? Situations where the scheduling clauses are most useful are: Scheduling Clause When to Use Static Predictable and similar work per iteration Dynamic Unpredictable, highly variable work per iteration Guided Special case of dynamic to reduce scheduling overhead Runtime Scheduling decision to be made at run time Present situations where the scheduling clauses are most useful. Clarify queries (if any) of the participants related to this slide. The runtime scheduling clause takes a schedule and chunk from the string assigned to the environment variable OMP_SCHEDULE at run time. You can find the best schedule empirically and then code it into the application source. ! Note 2018/9/19

Example – Scheduling Clause Consider the following example that uses the static clause: In the example, chunk = 8. If the value of start is 3, the first chunk is: i = {3,5,7,9,11,13,15,17} #pragma omp parallel for schedule(static,8) for (int i=start; i<=end; i+=2) { if (TestForPrime(i)) gPrimesFound++; } Explain the usage of the static clause: Iterations are divided into chunks of size 8. If the value of start is 3, the first chunk is i = {3,5,7,9,11,13,15,17}. At runtime, further chunks are created similarly. These chunks are then assigned to threads in a round-robin manner. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Additional Features of OpenMP OpenMP provides the following additional features that make parallel multithreaded programming easier and more efficient: Parallel sections The single construct The master construct The nowait clause The barrier construct The atomic construct Introduce the slide by saying that in addition to the features and constructs, OpenMP provides some additional features. These features make parallel multithreaded programming easier and more efficient. Name the features as listed on the slide. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Parallel Sections Parallel Sections Parallel sections are independent sections of code that can execute concurrently. The sections construct is used within a parallel region or with the #pragma omp parallel sentinel to create parallel sections. The sections construct has the following syntax: Serial Parallel #pragma omp parallel sections { #pragma omp section phase1(); phase2(); phase3(); } Introduce parallel sections: Parallel sections are independent sections of code that can execute concurrently. The sections construct is used within the parallel region or with the #pragma omp parallel sentinel to create parallel sections. The sections construct is a non-iterative work-sharing construct that contains a set of structured blocks to be divided and executed by the threads in a team. A parallel section parallelizes the serial code for execution. Explain the code snippet on the slide to explain the concept of parallel sections. The above example contains three independent function calls: phase1(), phase2(), and phase3(). The sections clause denotes the start of a parallel sections region that is enclosed in curly braces. Each independent block of code is tagged by inserting a #pragma omp section directive at the beginning of the block. If more than one statement is to be included in the section block, braces must be added to enclose these statements. In the above example, there is no need to block code within braces, since each block has only the one function call within it. Also, if you remove the third section directive, two tasks are defined, one task for phase1() and another task for phase2() followed by phase3(). Scheduling of tasks to threads in a parallel section is implementation-dependent. Sections are distributed among threads in the parallel team. Each section is executed only once, and each thread may execute zero or more sections. It is difficult to determine the sequence of execution of sections. As a result, the output of one section cannot serve as the input to another section. Instead, the section that generates output should be moved before the sections construct. There is an implicit barrier at the end of the sections construct. As a result, all threads need to wait for other threads to reach the barrier before exiting the construct. Clarify queries (if any) of the participants related to this slide. Parallel Sections 2018/9/19

The Single Construct The single construct can be used only when one thread executes any portion of code. The single construct has the following syntax: #pragma omp single #pragma omp parallel { DoManyThings(); #pragma omp single ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings(); } Introduce the single construct can be used only when one thread executes any portion of code. Present the syntax of the single construct. You can use the single construct when two parallel regions have negligible serial code in between. In such a case, you can combine the two regions into a single region reducing the overhead for fork-join and add the single construct for the serial portion. Explain the single construct with the help of the code snippet on the slide. In the above example, the team of threads starts executing in the parallel region. The threads execute the DoManyThings() function concurrently. One of the threads, after completing the DoManyThings() function, enters the single region where an implicit barrier exists at the end. The remaining threads skip the single region and wait for all threads to reach the barrier at the end of the single region. When the thread completes the single region and all other threads have completed their execution of DoManyThings() function, the team of threads concurrently executes the DoManyMoreThings() function. The thread chosen for the single construct is implementation-dependent. Clarify queries (if any) of the participants related to this slide. 2018/9/19

The Master Construct The master construct denotes the block of code to be executed only by the master thread and has the following syntax: #pragma omp master #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master, skip to next statement ExchangeBoundaries(); } DoManyMoreThings(); Introduce the master construct by stating that it denotes the block of code to be executed only by the master thread. State that, here, the master thread is chosen. Present the syntax of the master construct. In the example presented on the slide, the team of threads starts executing in the parallel region. The threads execute the DoManyThings() function concurrently. Only the master thread is allowed to enter the master region. There is no implicit barrier at the end. Thus, the remaining threads skip the master region and start executing the DoManyMoreThings() function without waiting for the master thread to complete. Excessive use of implicit barriers defeats the purpose of parallel execution. Unnecessary barriers hamper performance because the waiting threads are idle. You can use the nowait clause to defy these barriers, when safe. Clarify queries (if any) of the participants related to this slide. Move to the next slide by saying that excessive use of implicit barriers defeats the purpose of parallel execution. Unnecessary barriers hamper performance because the waiting threads are idle. You can use the nowait clause to defy these barriers, when safe. 2018/9/19

The Nowait Clause The syntax of the nowait clause that helps in defying the implicit barrier is: You can use the nowait clause when the threads are likely to wait between independent computations as given below: #pragma omp for nowait for (...) {...}; #pragma construct nowait {[...]}; #pragma omp for schedule(dynamic,1) nowait for (int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule(dynamic,1) for (int j=0; j<m; j++) b[j] = bigFunc2(j); Introduce the nowait clause by stating that the nowait clause allows threads to ignore the implicit barrier. When you use the nowait clause, the threads do not wait for the other threads at the end of the constructs. Present the syntax for the nowait clause. If you want to defy the implicit barrier at the end of a for loop, you can use the following syntax: #pragma omp for nowait for (...) {...}; You can use the nowait clause when the threads are likely to wait between independent computations. Explain the code snippet on the slide. In the example, threads enter the first loop and execute it. In the absence of the nowait clause, threads pause for all iterations of the first loop to complete. With the nowait clause, when the work exhausts in the first loop, threads begin executing work in the second loop. As a result, the nowait clause ignores the implicit barrier at the end of the first loop. It is possible that the second loop completes its execution before the first loop. Clarify queries (if any) of the participants related to this slide. 2018/9/19

The Barrier Construct The syntax of the barrier construct that enforces explicit barriers synchronization is: No thread can skip the barrier construct as given below: #pragma omp barrier #pragma omp parallel shared(A,B,C) { DoSomeWork(A,B); printf(“Processed A into B\n”); #pragma omp barrier DoSomeWork(B,C); printf(“Processed B into C\n”); } Introduce the barrier construct by stating that it enforces explicit barriers synchronization. This construct synchronizes all threads in the team. Present the syntax for the barrier construct. When a thread encounters a barrier construct, it waits for all the other threads to reach that barrier. All threads then start executing the code that follows the barrier construct. Explain the code snippet on the slide. A, B, and C are shared variables. Threads enter the parallel region and execute the DoSomeWork() function in parallel with A and B. You need to update the value of B before performing any computation involving B by calling the second DoSomeWork() function. Setting the barrier synchronization creates a threshold. Threads need to wait for the other threads before entering the next code block. Clarify queries (if any) of the participants related to this slide. 2018/9/19

The Atomic Construct The atomic construct creates a small critical region that executes faster than the Critical Section because of its smaller size and uses the following syntax: The atomic construct works only on the statement that follows the construct as given below: #pragma omp atomic #pragma omp parallel for shared(x,y,index,n) for (i=0; i<n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } Introduce the atomic construct by stating that the atomic construct is used to protect shared variables. This is done by ensuring that the specific memory location is updated atomically. It prohibits multiple threads from writing at the same memory location, which prevents data races. Present the syntax for the atomic construct. The atomic construct specifies that a specific memory location must be updated atomically. It prohibits multiple threads from writing at the same memory location, which prevents data races. The atomic construct works only on the statement that follows the construct. Explain the code snippet on the slide. In the above example, value of index[i] is the same for different values of the variable i, so updates to x must be protected. Therefore, updating an element of the array x[] is declared as atomic. The other computations, such as the call to work1() and the computation of the value in index[i], cannot be done atomically. If a critical region were used, instead of the atomic, all computations within the statement (call to work1(), computation of index[i], and update of element in x) would be protected. The atomic construct protects only individual elements of the x[] array. As a result, you can update two elements of x[] in parallel because they are atomic to each other in the array. However, use of a critical would serialize the update of two elements of x[]. There are only a few specific types of statements on which you can use the atomic clause: x binop = expr x++ ++x x-- --x In the above statements: x is a scalar variable. expr is a scalar expression that does not reference x. binop or binary operator is not overloaded and is one of +, *, -, /, &, ^, |, <<, or >>. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Extensive API OpenMP API provides implementation-specific capabilities and run-time control. The API also provides library functions that help perform data processing efficiently. Two of the several API functions available are: Function Name Description int omp_get_num_threads(void); Returns the number of threads currently in the team that executes the parallel region from which it is called. If called from a serial portion of the program or a nested parallel region that is serialized, this function will return 1. The default number of threads is implementation-dependent. int omp_get_thread_num(void); Returns the thread number of the thread making the call. If called from a serial region, this function will return 0. Introduce the slide by saying that the OpenMP API provides implementation-specific capabilities and run-time control. The API also provides functions that help perform data processing efficiently. Explain the library functions listed in the table on the slide: int omp_get_num_threads(void);: Returns the number of threads currently in the team that executes the parallel region from which it is called. If called from a serial portion of the program or a nested parallel region that is serialized, this function will return 1. The default number of threads is implementation-dependent. int omp_get_thread_num(void);: Returns the thread number of the thread making the call. If called from a serial region, this function will return 0. To use the function calls, include the <omp.h> header file. The compiler automatically links to the correct libraries. These functions are not usually needed for OpenMP code except for debugging. To fix the number of threads in the parallel region, set the number of threads and save the number. To use the function calls, include the <omp.h> header file. The compiler automatically links to the correct libraries. These functions are not usually needed for OpenMP code except for debugging. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Determining the Number of Threads To specify the number of threads that execute in a parallel region, you can set an environment variable: Consider the following example of fixing the number of threads: set OMP_NUM_THREADS=4 #include <omp.h> void main() { int num_threads; omp_set_num_threads(omp_num_procs()); #pragma omp parallel int id = omp_get_thread_num(); #pragma omp single num_threads = omp_get_thread_num(); do_lots_of_stuff(id); } Protect this operation because memory stores are not atomic. Request same number of threads as the number of processors. Introduce the slide by saying that you can specify the number of threads executing in a parallel region by setting an environment variable. Present the API that helps set the number of threads in the code. There is no default value for this environment variable. In general, programmers set the number of threads to be the same as the number of processors. Intel compilers set the number of threads to be the same as the number of processors by default. To fix the number of threads in the parallel region, set the number of threads and save the number. Explain the code snippet on the slide. The code requests the system for the number of processors and sets the number of threads accordingly. Therefore, if your system has four processors, you get four threads. When you use OpenMP, do not over-subscribe or have more number of threads than the number of CPUs. Extra threads need to wait for an idle processor to start execution and synchronize with other threads at the implicit barrier. This results in an overhead. You declare a parallel region to protect the specifying operation. The pragma omp parallel construct requests the compiler to create threads to execute the code in the parallel region. The parallel construct begins with assigning IDs for each worker thread. The pragma omp single construct specifies that only one thread should execute the statement following this pragma. The num_threads variable is a shared variable by default. However, by specifying it inside the single construct, you can ensure that only one thread accesses num_threads at a given time. Clarify queries (if any) of the participants related to this slide. 2018/9/19

Summary OpenMP is an API used for writing portable, multithreaded applications. OpenMP is a pragma-based approach to parallelism. OpenMP is based on the fork-join programming model, which enables incremental parallelism. Parallel regions refer to structured blocks of code that may be executed in parallel by a team of threads. Variables accessed within the parallel region are shared by default. With OpenMP, it is not a good practice to over-subscribe or have more threads than cores. A work-sharing construct divides the workload across multiple threads. This construct is defined within a parallel construct. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) The for directive distributes the iterations of the next for loop among the worker threads created by the enclosing parallel construct. Data environment constructs control the data environment during the execution of parallel constructs. These constructs help define the scope of data variables. Stack, automatic, and loop index variables within work-sharing constructs are exceptions to shared variables. You can protect shared variables by using the critical construct. This construct allows only one thread to execute the enclosed code at a time. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) You can create private copies of variables that will be combined by an associative operation at the end of a region by using the reduction clause. The variables in the variable_list must be shared in the enclosing parallel region. The schedule clause facilitates work sharing by describing how iterations of the loop are divided among the threads in the team. The single construct can be used when only one thread must be used to execute a portion of code within a parallel region. The master construct denotes the block of code to be executed only by the master thread with a parallel region. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) Excessive use of implicit barriers defeats the purpose of parallel execution. Unnecessary barriers hamper performance because waiting threads are idle. The nowait clause helps in defying the implicit barrier. When you use the nowait clause, the threads do not wait for the other threads at the end of the construct. The barrier construct enforces explicit barrier synchronization and synchronizes all threads in the team. The atomic construct protects the update of shared variables by making such updates execute atomically. OpenMP API provides implementation-specific capabilities and run-time control. Summarize all the key points learned in the chapter. 2018/9/19

Agenda Course Introduction Multithreaded Programming Concepts Tools Foundation – Intel® Compiler and Intel® VTuneTM Performance Analyzer Programming with Windows Threads Programming with OpenMP Intel® Thread Checker – A Detailed Study Intel® Thread Profiler – A Detailed Study Threaded Programming Methodology Course Summary Provide participants the agenda of the topics to be covered in this chapter. 2018/9/19

Intel® Thread Checker – A Detailed Study Chapter 5 2018/9/19

Objectives At the end of the chapter, you will be able to: Identify threading correctness issues using Intel® Thread Checker. Determine if library functions are thread-safe. Ask the participants, what they expect from the session. After the participants share their expectations, introduce the objectives of the session in the following manner: Developing threaded applications involves complicated processes. You need to locate and identify errors in the applications. Errors such as data races and deadlocks occur when concurrent threads interact with each other, but may only occur intermittently due to asynchronous scheduling. More than one thread may access the memory without synchronization, or threads may wait for an event that never happens. The threading tools provided by Intel such as Intel® Thread Checker and Intel® Thread Profiler, help you locate threading errors and correct performance issues. This chapter focuses on Intel Thread Checker. Explain the objectives of the session. Clarify their queries (if any) related to the objectives. 2018/9/19

Introducing Intel® Thread Checker Is a debugging tool used on threaded applications. Can detect threading bugs in Windows threads, POSIX threads, and OpenMP threaded applications. Detects potential threading related bugs even if they do not occur. Is a plug-in to Intel® VTuneTM Performance Analyzer with the same look, feel, and interface as the VTune Analyzer environment. Can reduce the turnaround time for bug detection and isolation. You can use Intel Thread Checker as: Design aid Debug aid Introduce the slide: Multithreaded applications enhance the performance of your application by enabling parallel execution. However, multithreading has its disadvantages because you may introduce bugs while threading an error-free serial application. Introduce Intel Thread Checker: Intel Thread Checker detects potential threading related bugs even if they do not occur. Therefore, it helps you create correct and safe multithreaded applications. Intel Thread Checker is a debugging tool used on threaded applications. It can detect threading bugs in Windows threads, POSIX threads, and OpenMP threaded applications. It is a plug-in to VTune Performance Analyzer with the same look, feel, and interface as the VTune Analyzer environment. With traditional methods, locating threading bugs may take a very long time. In fact, debugging tools and techniques may even hide the effects of threading bugs, making it even more difficult to establish the cause of errors. However, with Intel Thread Checker, you can reduce the turnaround time for bug detection and isolation. Intel Thread Checker can be used as the following: Design aid: You can create a prototype application with OpenMP by adding appropriate pragmas. Intel Thread Checker can identify conflicts in the application and generate a report. This report helps you analyze the issues and find solutions during the design phase of your application. Debug aid: You can identify actual and potential bugs in threaded applications. Clarify their queries (if any) related to this slide. 2018/9/19

Features of Intel® Thread Checker Intel® Thread Checker has the following features: Supports various compilers Provides a powerful and an intuitive user interface Identifies the threading issues in detail Maps potential errors Provides an application programming interface (API) for user-defined synchronization primitives Explain the features of Intel Thread Checker: Supports various compilers: Intel Thread Checker works with compilers such as: Intel® C++ and Fortran compilers, v7 and higher versions Microsoft Visual C++, v6 Microsoft Visual C++ .NET 2002, 2003, and 2005 Editions Intel Thread Checker can be integrated into Microsoft Visual Studio .NET IDE. Provides a powerful and an intuitive user interface: Intel Thread Checker displays the threading diagnostics in a list. You can categorize, arrange, and sort the diagnostics according to your requirements. A summarizing histogram provides a visual comparison of the amount of diagnostics in any categories. Identifies the threading issues in detail: Intel Thread Checker identifies five types of threading issues: error, warning, caution, information, and remark. If the debug information is available, the tool identifies the source location. Intel Thread Checker also provides one-click help for diagnostics. Maps potential errors: Intel Thread Checker uses an advanced error-detection engine to identify data races and deadlocks. It helps you design effective threaded applications. You can also find errors that may not occur during your manual testing. Provides an application program interface (API) for user-defined synchronization primitives: Intel Thread Checker does not identify user-defined synchronization operations. You can use Intel Thread Checker API functions to identify the start and end of critical regions. Clarify their queries (if any) related to this slide. 2018/9/19

Intel® Thread Checker Basics Intel® Thread Checker analysis: Is a dynamic process that occurs when you run the software. Supports data-driven execution. Monitors the running activity and detects threading issues, such as data races, deadlocks, and thread stalls. Intel Thread Checker concepts: Instrumentation: Instrumentation adds library calls to record information on thread execution, calls to synchronization APIs, and memory accesses. Workload selection: It is recommended that you use the smallest possible data sets because they decrease the execution time of the application. Intel Thread Checker analysis is a dynamic process that occurs when you run the software. It supports data-driven execution. Using Intel Thread Checker, you can set up and run any activity just like a VTune Activity or an experiment. Intel Thread Checker monitors the activity and detects various threading issues, such as data races, deadlocks, and thread stalls. If the activity executes within only a fraction of the application, Intel Thread Checker detects errors only in those portions of code that are executed. Errors in the remaining, unseen portions of the code still remain unknown. Intel Thread Checker monitors the threads and calls to synchronization APIs. It detects when there is a memory access by a thread. Then, it records the instances and looks for potential data races. In order to perform this monitoring, the code must be instrumented. Introduce Intel Thread Checker concepts: Instrumentation: Instrumentation adds library calls to record information on thread execution, calls to synchronization APIs, and memory accesses. Workload selection: It is recommended that you use the smallest possible data sets because they decrease the execution time of the application. Highlight that you must instrument and execute the code containing bugs so that Intel Thread Checker can identify the bugs. Intel Thread Checker cannot perform static analysis on parts of unexecuted code or on dynamic-link libraries (DLLs) that are not instrumented. Clarify their queries (if any) related to this slide. ! Note Threading errors do not need to manifest during Intel Thread Checker analysis. However, you must instrument and execute the code suspected of containing bugs so that Intel Thread Checker can identify any problems. 2018/9/19

Instrumentation Instrumentation: The two kinds of instrumentation are: Adds benign Intel® Thread Checker library calls to the software to be traced. Records the thread information such as thread execution order and calls to synchronization APIs and memory accesses. Increases the size of the code and the time it takes to run the application. The two kinds of instrumentation are: Binary Source Introduce instrumentation by providing its definition: Intel Thread Checker for Windows API instruments software before tracing it to detect errors. Instrumentation adds benign Intel Thread Checker library calls to the software to be traced. These calls record thread execution information. Instrumentation increases the size of the code and the time it takes to run the application. This is because Intel Thread Checker records memory access and threading API calls that each thread uses when your software runs. There are two kinds of instrumentation, binary and source. Clarify their queries (if any) related to this slide. 2018/9/19

Binary Instrumentation Is added at run time to an already built binary module. Is effective in supporting program analysis, debugging, security, and simulation. Can be used for software compiled with any of the supported compilers. Binary instrumentation is recommended in the following situations: Unavailability of an appropriate Intel compiler. Inability to rebuild the application due to shortage of time. Inaccessibility of the source code. Discuss binary instrumentation: Binary instrumentation is added at run time to an already built binary module, which includes applications and dynamic or shared libraries. Binary instrumentation is effective in supporting program analysis, debugging, security, and simulation. When you run an Intel Thread Checker Activity in the VTune environment, binary instrumentation is automatically added to your code. You can use this instrumentation for software that is compiled with any of the supported compilers. To enable binary instrumentation, you must run the application within Intel Thread Checker and get the diagnostics. Binary instrumentation is recommended in the following situations: Inaccessibility of an appropriate Intel compiler Inability to rebuild the application due to shortage of time Inaccessibility of the source code Highlight that binary instrumentation requires the code linked with the /fixed:no switch because it adds code to the application at thread API calls and memory access points. Clarify their queries (if any) related to this slide. ! Note Binary instrumentation requires code linked with the /fixed:no switch because it adds code to the application at thread API calls and memory access points. 2018/9/19

Source Instrumentation Refers to source code instrumentation. Is supported only by Intel® compilers such as Intel C++ or Fortran. Requires you to add the /Qtcheck flag to compilation switches to enable source instrumentation of the threaded application. You can run the application using source instrumentation in two ways: In the Intel® VTuneTM Performance Analyzer environment From the Windows command line Source instrumentation is recommended in the following situations: If you are using Intel compliers like Intel C++. If you want to run the instrumented program outside the VTune Performance Analyzer, such as for a server application. Discuss source instrumentation: Source instrumentation refers to source code instrumentation, which only Intel compilers for C++ or Fortran support. You add the /Qtcheck flag to the compilation switches to enable source instrumentation of the threaded application. This flag works for OpenMP and Windows threads. You can run the application by using source instrumentation in two ways: In the VTune Analyzer environment: You can run the source code in Intel Thread Checker. This instruments the additional DLLs along with the application. From the Windows command line: Data is collected in the threadchecker.thr results file. You can view the .thr file in the VTune environment. When you run the application from the command line prompt, additional DLLs are not instrumented or analyzed. Source instrumentation is recommended in the following situations: If you are using Intel Tools like Intel C++ compiler. If you want to run the instrumented program outside the VTune Performance environment, such as for a server application. The advantage of using source instrumentation is that you can get the most detailed diagnostics, such as variable names about the detected problems. If you use binary instrumentation, Intel Thread Checker may not identify the variables or locate the specific source lines that are involved the problem. If an application takes a long time to build, it may not be feasible for you to use source instrumentation on the entire source. However, if the binary code is compiled and linked correctly, an initial pass can be done with binary instrumentation. This identifies the modules that have threading errors. If the problems are not listed in the results, you can recompile specific source files with source instrumentation and rerun the application through Intel Thread Checker. Clarify their queries (if any) related to this slide. 2018/9/19

Instrumentation Level Instrumentation Levels The different levels of instrumentation are: Instrumentation Level Description Full Image Each instruction in the module is instrumented to check for a diagnostic message. Custom Image Same as Full Image. However, a user can disable selected functions from instrumentation. All Functions Turns on full instrumentation for parts of a module that were compiled with debug information. Custom Functions Same as All Functions. However, a user can disable selected functions from instrumentation. API Imports Only system API functions that are needed are instrumented. No user code is instrumented. Module Imports Disables instrumentation. This is the default setting for system images, images without base relocations, and images not containing debug information. Define instrumentation levels: The instrumentation level specifies the amount of instrumentation that applies to specific parts of the application. Intel Thread Checker has three clients: user DLLs, user codes, and system DLLs. It has a default level of instrumentation that it uses if it cannot instrument at the specified level. Explain the levels of instrumentation given in the table on the slide. Depending on the required level of information, you can set the level of instrumentation. Higher levels of instrumentation increase memory usage and analysis time. However, they provide more details. Using the Instrumentation tab from the Intel Thread Checker Collector Configuration window, you can control the instrumentation level to the individual routine level if necessary. You can also manually adjust the levels of instrumentation to increase the speed of your application or control the amount of information gathered. It is recommended that you use the All Functions level as default when you analyze applications and dynamic and shared libraries. If you want Intel Thread Checker to record all diagnostics without any constraints about memory usage, you can use the Full Image level. You should not lower the instrumentation level below All Functions for user code because Intel Thread Checker may not produce an accurate and exhaustive diagnostic list. Clarify their queries (if any) related to this slide. 2018/9/19

Working Guidelines Some guidelines on how to define the smallest possible workload are: Execute the problem code only once per thread to identify the error. Use the smallest possible working data set: Minimize data set size Minimize loop iterations or time steps Minimize update rates Discuss some guidelines on how to define the smallest possible workload: Execute the problem code once per thread to identify the error: Threaded and concurrent code that contains errors need only be executed once. Use the smallest possible working data set: Some ways to reduce workload are: Minimize data set size: You should reduce the data set size as much as possible. For example, execute on 10 data elements instead of 10,000 data elements. Smaller image sizes also reduce the data set size. Minimize loop iterations or time steps: You should simulate minutes instead of days. Minimize update rates: You should lower frame’s refresh rate to speed up updates of frames. Intel Thread Checker performs debugging and not performance analysis. Therefore, you should concentrate on lowering the amount of data that Intel Thread Checker uses instead of getting a highest frame rate. Define your workload in order to get maximum code coverage with the threaded portions of the code that need to be analyzed. If you cannot reduce the size of data set easily, you can terminate the analysis from the VTune Graphic user Interface (GUI) after a sufficient lapse of time. This helps you reach and execute the required threaded code. Clarify their queries (if any) related to this slide. ! Note If you cannot reduce the size of data set easily, you can terminate the analysis from the VTune Analyzer Graphic User Interface (GUI) after a sufficient lapse of time. This helps you reach and execute the required threaded code. 2018/9/19

Compilation and Linking Requirements The compilation and linking requirements of applications for Intel Thread Checker analysis are outlined below: Compilation considerations: Use dynamically linked thread-safe run-time libraries: Use the /MD, /MDd, /MT, or /MTd switch. Generate symbolic information: Use the /Zi, /ZI, or /Z7 switch. Disable optimization: Use the /Od switch. Linking considerations: Preserve symbolic information: Use the /DEBUG switch. Specify relocation code sections: Use the /fixed:no switch. Explain the compilation and linking requirements of applications for Intel Thread Checker analysis: Compilation Considerations: Use dynamically linked thread-safe runtime libraries: You should compile your code with the /MD, /MDd, /MT, or /MTd switch to enable use of thread-safe run-time libraries. The default switch for Microsoft and Intel® compilers is /ML or /MLd. This switch does not use thread-safe run-time libraries. Generate symbolic information: You should use the /Zi, /ZI, or /Z7 switch to compile your code. These switches include symbols within the binary. With symbol information, it is easier for Intel Thread Checker to point out the lines of source code that are involved in any threading errors. Disable optimization: You should use the /Od switch to compile your code. Higher levels of optimization can modify the order of code within an application. Disabling optimization keeps binary code closer to the originally written source. Intel Thread Checker is more efficient in finding errors when optimizations are disabled. If a code displays threading errors with high-level optimizations (/03) and not with optimization disabled (/Od), the problem is more likely with the compiler than with the threads in the application. Linking Considerations: Preserve symbolic information: When you use the /Zi, /ZI, or /Z7 switch for compiling the code to generate symbolic information, you should use the /DEBUG link switch to preserve the symbolic information. Specify relocation code sections: You should build your code so that it contains a relocation section that allows for relocation of memory. Instrumentation requires that relocation be enabled. Most dynamic-link library (.dll) files can be relocated by default. However by default, most executable (.exe) files cannot be relocated. Therefore, you should link with the /fixed:no switch to allow for relocation. To add a relocation section, you can set the environment variable LINK to the /fixed:no value and relink your code. Clarify their queries (if any) related to this slide. 2018/9/19

Creating an Intel® Thread Checker Activity Present the steps to create an Intel Thread Checker Activity: Start VTune Performance Analyzer. The Easy Start window opens. In the Easy Start window, click the New Project button to display the New Project window. In the New Project window, on the Category drop-down menu, select the Threading Wizards option. A list of options appears in the lower display window. From the list of options in the lower display window, select the Intel® Thread Checker Wizard option to display the Intel® Thread Checker Wizard window. Select the application that you want to analyze. Fill in any command line arguments, as needed, in the appropriate place. Once the activity has been run and results have been collected, Intel Thread Checker provides various views for analysis. Clarify their queries (if any) related to this slide. The Intel® Thread Checker Wizard Option 2018/9/19

Diagnostics View Diagnostics View 2018/9/19 Introduce the slide by saying that after you create an Activity and run the application, Intel Thread Checker displays the results in the Diagnostics View. The Diagnostics View displays all the diagnostics that the Intel Thread Checker Analysis Engine generated for your program, and categorizes them by severity. Diagnostics indicate issues or events that are non-deterministic, and that may lead to indeterminate or wrong results. Usually they are due to a conflict (data race) which is due to two different threads accessing the same memory location at the same time without the proper synchronization needed to guarantee consistent results. Present the screen shot on the slide. In Diagnostics View, you can hide or show a variety of data columns in different formats. When working with columns, you can: Show or hide columns. Sort by column by clicking that column's heading. Group diagnostics by dragging and dropping columns to the grouping area. Entries in the Diagnostics View includes: Context: This column shows how the threads represented by the 1st Access and 2nd Access locations are related. 1st Access refers to the thread that accessed the object first. 2nd Access refers to the thread that referenced the object later. If the diagnostic is generated by two threads of the OpenMP threaded application, the Context column refers to a function. The Context column is also used to group multiple diagnostics involved in a complex pattern such as a deadlock cycle. ID: This column shows a simple numbering of the diagnostics. The values in the ID column are unique to an Activity result and are assigned by the Intel Thread Checker. Short Description: This column provides a short description, limited to one row, about the diagnostic. Description: This column provides a longer and more complete description of the diagnostic. Filtered: This column indicates whether the given diagnostic is filtered. False indicates that the diagnostic is not filtered. True indicates that the diagnostic is filtered. Count: There are two count columns available. This column indicates the number of times that the Intel Thread Checker generated the same diagnostic. Duplicate Counts: This column indicates redundant or duplicate messages. It includes diagnostics on the same or different stacks and exact duplicates. Severity: This area displays the severity level of issues. The Severity columns in the Diagnostics list show indicate the relative seriousness of each diagnostic. Resolution of issues with a high severity level is the most effective method of dealing with a list of diagnostics. Mention the various severity levels as shown on the slide. Clarify their queries (if any) related to this slide. Diagnostics View 2018/9/19

Description and Impact Diagnostics Grouping The various severity categories in order of priority are: Severity Rank Name Description and Impact Examples 4 Error Indicates a likely or actual problem in your program. Errors have the highest priority impact, so you should look at this group first. Data races, deadlocks, and other serious issues fall into this category. 3 Warning Indicates situations that probably will not result in incorrect behavior of your program, but may benefit from fixes. Inaccessible memory. 2 Caution May or may not be an issue; indicates that something is unusual. A thread trying to release a lock which it does not own, or a notify operation that occurred when no other thread was waiting for it, making it a no-op. 1 Informational Conveys general information that is specific to your program. Messages indicating the amount of stack space allocated. Remark Conveys notes that generally do not contain specific data about your program. However, they may provide general information that may apply to your program. Too many errors to display. Explain the table given on the slide. Clarify their queries (if any) related to this slide. 2018/9/19

Diagnostics Grouping View You can group data columns in the diagnostics list in the following ways: Task Action Group diagnostics Drag and drop columns headers to the grey area at the top of the list. Filter some diagnostics out of view Right-click to open the pop-up menu and select Filter Diagnostic. You can then view filters you applied. Sort diagnostics Click a column header to sort data by that column. By default, columns are grouped by context. Add a column Right-click and select Show Column. Remove a column Right-click and select Hide Column. See corresponding source code location Double-click on a diagnostic. Source Views open. Understand diagnostics Right-click on a diagnostic and select Diagnostic Help. Understand columns Right-click on a column and select Column Help. State that a flat list may not be very useful. You can group and sort diagnostics to help you resolve the problems. Explain the table given on the slide. Clarify their queries (if any) related to this slide. 2018/9/19

Diagnostics Grouped by the Short Description Field Diagnostics Grouping View (Continued) Present and explain the screen shot on the slide. By default, columns are grouped by Context. In the screen shot, the distribution histogram displays new group categories, such as thread termination, a different grouping for each of the different data-race conditions. You can now deal with write–write, read–write, and write–read errors conveniently. Clarify their queries (if any) related to this slide. Diagnostics Grouped by the Short Description Field 2018/9/19

Horizontal Mode of the Source Code View Introduce the slide by saying that when you double-click an entry in the diagnostics list, the Source Code View displaying source code locations of the error appears. By default, the 1st and 2nd Access views appear. Present and explain the screen shot on the slide. The 1st Access View shows the location of the line where the first thread was executing when the conflict was noted. The 2nd Access View shows the location of the line where the second thread was executing when the conflict was noted. For example, consider that two threads try to write to variable X simultaneously. In this case, a data race exists. The two Access Views show the lines of code where the two threads were executing. It is possible for both threads to be running the same line of code. You can select the desired view from the drop-down list. To reorganize the display of the Source Code View, you can click the horizontal and vertical swap buttons to toggle between horizontal and vertical views. In the figure, two threads simultaneously execute the same line of code. This is a write–write error because both the threads are trying to update the variable distz. Clarify their queries (if any) related to this slide. Horizontal Mode of the Source Code View 2018/9/19

Steps to Open Diagnostic Help View 1) Right-click here Introduce the slide by saying that to obtain help on a diagnostic, right-click the diagnostic in the diagnostics list to reveal a menu. From the menu, select the Diagnostic Help option. Present and explain the screen shot on the slide. The Diagnostic Help View provides examples of how you can create the selected diagnostic. Online help also includes some advice on the most common causes of the problem and some common fixes to rectify the error. Clarify their queries (if any) related to this slide. 2) Click here Steps to Open Diagnostic Help View 2018/9/19

Activity 1A: Finding Prospective Data Races Objective: Find data races within a simple physical model code using Intel® Thread Checker. Extra Activity: Check the source code lines from some of the diagnostics. Discussion Question: Why is there a conflict on certain lines of code? Introduction to the first part of the First Lab Activity. Explain to the participants the objective for the activity. Question for Discussion: Why is there a conflict on certain lines of code? Clarify their queries (if any) related to this slide. 2018/9/19

Dealing with Large Diagnostics Counts Suggestions while organizing and prioritizing diagnostics are: Add the 1st Access column. Group by the 1st Access column. Sort by the Short Description column. Explain large diagnostics counts: When you use Intel Thread Checker, you are likely to experience large diagnostics counts. Consider that you receive 5000 diagnostics. How would you decide where to start debugging? Not all messages are equally important. One correction in the code may remove many related diagnostics. For example, making a variable local to each thread removes all conflicts related to that variable, which might be referenced by different lines of code. Consider the following suggestions while organizing and prioritizing diagnostics: Add the 1st Access column. Group by the 1st Access column. Sort by the Short Description column. Clarify their queries (if any) related to this slide. 2018/9/19

Add the 1st Access Column Add the 1st Access column if it not already present Present the screen shot on the slide. 1st Access Column shows the source code location where the first thread was executing when the memory location involved in the conflict or diagnostic was accessed. To add the 1st Access column, right-click any diagnostic to reveal the context-sensitive menu. Select the Show Column option to reveal a sub-menu. Select the 1st Access option to reveal its sub-menu. Select the 1st Access option. The 1st Access column displays the code line number where the first time a thread accessed the variable responsible for the data race. Clarify their queries (if any) related to this slide. Steps to Add the 1st Access Column 2018/9/19

Group by the 1st Access Column Groups errors reported for the same source line. Present the screen shot on the slide. To group by the 1st Access column, drag the 1st Access column to the top bar. Grouping by the 1st Access column groups diagnostics by the line number on which the first thread was executing. This localizes all the errors by source code lines. As a result, each group can be seen as the same issue. Clarify their queries (if any) related to this slide. Diagnostics Grouped by the 1st Access Column 2018/9/19

Sort by the Short Description Column Present the screen shot on the slide. To sort by the Short Description column, click the column header. This will sort diagnostics within each group. When you sort the diagnostics by the Short Description column, you handle the write–write conflicts first because they hold the highest priority. Highlight that solving the write–write conflicts is likely to solve related write–read and read–write errors. Clarify their queries (if any) related to this slide. Diagnostics Sorted by the Short Description Column 2018/9/19

Dependence Analysis Consider the following serial code sample: The three types of dependencies are: Flow dependence or write–read conflict: Between S1 and S2 Anti-dependence or read–write conflict: Between S2 and S3 Output dependence or write–write conflict: Between S3 and S4 S1: A=1.0; S2: B=A+3.14; S3: A=1/3*(C–D); .............................. S4: A=(B*3.8)/2.7; Introduce the code sample on the slide. The above code sample contains four code statements: S1, S2, S3, and S4, which are computing values for variables A and B. There is dependence between S1 and S2. S2 cannot be executed until S1 has executed because the value of A is used to calculate value of B. As a result, the statements S1 and S2 cannot run on different threads. This is known as flow dependence between S1 and S2. Such a situation, when the two statements are run on different threads, is also known as a write–read conflict. A write–read conflict is a situation when one thread updates a variable that is concurrently read by another thread. Consider the statements S2 and S3. S2 reads the value of A, and S3 writes the value of A. Therefore, there is anti-dependence between S2 and S3. These two statements cannot be run in parallel. The value taken as an answer requires the value of A to be written first. Such a situation, when the two statements are run on different threads, is also known as a read–write conflict. A read–write conflict is a situation when one thread reads a variable that is concurrently updated by another thread. Output dependence occurs between two statements running on different threads during a write–write conflict. In the above code sample, statements S3 and S4 both update the value of A. Therefore, S3 and S4 should not be run in parallel or on different threads at the same time. In such a scenario, there is no certainty about the final value of A. A write–write conflict is a situation when one thread updates a variable that is subsequently updated by another thread. The compiler can perform dependence analysis on the source code to determine the lines that can be run in parallel. Dependence analysis deals with optimization techniques, especially auto-parallelization for serial codes. To guarantee proper execution and results of the code, the statements that contain write–write, write–read, and read–write dependencies must execute in the same relative order. Clarify their queries (if any) related to this slide. 2018/9/19

Race Conditions Revisited Definition: Race conditions occur when the programmer assumes an order of execution of code. The two methods to solve race conditions are: Scope variables local to threads. Control shared access with critical regions. Define race conditions: Race conditions occur when the programmer assumes an order of execution of code. However, there is no guarantee that threads will follow this order. There are the following two methods to solve race conditions: Scope variables local to threads. Control shared access with critical regions. Clarify their queries (if any) related to this slide. 2018/9/19

Method 1: Scope Variables Local to Threads When to use: When the value of the variable identified as involved in a potential data race is used inside the parallel region only When using temporary or work variables How to implement: Use the OpenMP scoping clauses. Declare variables within threaded functions. Allocate variables on thread stack. Use Thread Local Storage (TLS) API. Explain the first method to solve race conditions: One way to solve race conditions is to scope variables local to threads. This is a good solution when the value of the variable identified as involved in a potential data race is used inside the parallel region only. You can also use this method when dealing with temporary or work variables such as those used to keep partial sums. Explain the various ways to implement this method: Use the OpenMP scoping clauses: OpenMP provides scoping clauses such as private, which can make variables local to threads. Declare variables within threaded functions: You can explicitly declare variables that will remain local to each thread. Allocate variables on thread stack: If Intel Thread Checker identifies a variable as a potential data race, you can allocate that variable on thread stack by using the alloca() function. Use Thread Local Storage (TLS) API: The TLS API is present in Windows threads and Pthreads. It guarantees that storage is only accessible to each individual thread. Clarify their queries (if any) related to this slide. 2018/9/19

Method 2: Control Access with Critical Regions When to use: When the value of the variable identified within a potential data race is used inside and outside the parallel region To update the value of the same shared variable required by each thread How to implement: Use synchronization objects (mutex, semaphore, and Critical Section). Use synchronization constructs (critical and atomic). Explain the second method to solve race conditions: The second way to solve race conditions is by controlling access to shared variables by using synchronization objects, such as mutexes and Critical Sections. This is a good solution when the value of the variable identified within a potential data race is used inside and outside the parallel region. You can also use this method to update the value of the same shared variable required by each thread. Explain the various ways to implement this method: Use synchronization objects: You can use synchronization objects, such as mutexes, events, semaphores, and Critical Sections, to control the access to shared variables. Use synchronization constructs: You can use OpenMP synchronization constructs, such as critical and atomic, depending on the threading API that you are using. Highlight that it is recommended that you use one lock per data element. Using the same lock or synchronization object for every access to a single data element ensures correct mutual exclusion. Clarify their queries (if any) related to this slide. ! Note It is recommended that you use one lock per data element. Using the same lock or synchronization object for every access to a single data element ensures correct mutual exclusion. 2018/9/19

Activity 1B: Resolving Data Races Objective: Resolve data races found in previous lab using simple threading techniques. Questions for Discussion: Do the answers from the threaded code match with the output from the serial application? Introduction of the second part of the First Lab Activity. Explain to the participants the objective for the activity. Question for Discussion: Do the answers from the threaded code match with the output from the serial application? Clarify their queries (if any) related to this slide. 2018/9/19

Intel® Thread Checker: Implementation Assistant You can use Intel® Thread Checker as a threading assistant in the following ways: Use OpenMP as a prototype to insert threading. Compile and run program in Intel Thread Checker. Review diagnostics to identify the problem areas in your source. Restructure the code or protect access to shared variables depending on the diagnostics. Introduce the slide: You can use Intel Thread Checker in the design phase of the threading methodology as an implementation assistant. Depending on the section of code you want to thread, you can scan the application to identify the shared and private variables. Suppose you are not using Intel Thread Checker and have hundreds of lines of code in the portions of code that you want to thread. You need to analyze variables for dependencies. In such a scenario, you have to manually go through each line of code. Also, you need to discover if different pointers can refer to the same memory location. All of this can be very difficult, especially when your code has multiple levels of calls in its function structure. Intel Thread Checker proves useful in such situations. You can use OpenMP as a prototype to insert threading. Also, you need not bother about identifying shared and private variables in your code. You can simply compile and run your program in Intel Thread Checker. Intel Thread Checker can identify potential data races in your code. Then, you can review diagnostics to identify the problem areas in your source. You can also restructure the code or protect access to shared variables depending on the diagnostics. The next topic focuses on how Intel Thread Checker helps identify deadlocks and thread stalls. Clarify their queries (if any) related to this slide. 2018/9/19

Deadlock Example: Traffic Jam Definition: Deadlock is a situation when a thread waits for an event that never occurs. The most common cause for a deadlock is a locking hierarchy. Define deadlock as a situation when two or more threads wait for an event that never occurs. A locking hierarchy is the most common cause for a deadlock. Present an example to illustrate a locking hierarchy: Consider that you use one lock per data item. To update or manipulate two protected data elements, you need to use two different locks. There is no function in Windows or Pthreads API that allows you to lock or unlock two locks at the same time. You need to apply locks in some order. Such a situation is called a locking hierarchy. To avoid deadlock resulting from such locking hierarchies, you should lock and unlock in the same order. Clarify their queries (if any) related to this slide. Example: Traffic Jam 2018/9/19

Examples of Deadlock DWORD WINAPI threadA(LPVOID arg) Consider the following example of an obvious locking hierarchy: DWORD WINAPI threadA(LPVOID arg) { EnterCriticalSection(&L1); EnterCriticalSection(&L2); processA(data1, data2); LeaveCriticalSection(&L2); LeaveCriticalSection(&L1); return(0); } DWORD WINAPI threadB(LPVOID arg) { EnterCriticalSection(&L2); EnterCriticalSection(&L1); processB(data2, data1); LeaveCriticalSection(&L1); LeaveCriticalSection(&L2); return(0); } Present the first example code on the slide to illustrate a locking hierarchy. In the code sample, code in the threadA() function uses data elements, data1 and data2. The Critical Section L1 protects the data element data1, and the Critical Section L2 protects the data element data2. Thread A first calls EnterCriticalSection() with L1 and locks it. Then, thread A calls EnterCriticalSection() with L2 and locks it. After using data1 and data2 to execute the processA() function, thread A unlocks the Critical Sections L2 followed by L1. Present the second code sample on the slide. In the second code sample, thread B enters the threadB() function, which uses data1 and data2. Here again, the Critical Section L1 protects data1, and the Critical Section L2 protects the data element data2. However, in this case, thread B first calls EnterCriticalSection() with L2 and locks it. Then, thread B calls the EnterCriticalSection() function with L1 and locks it. After using data1 and data2 to execute the processB() function, thread B unlocks the Critical Sections L1 followed by L2. When threads A and B run in parallel, you may have the situation where thread A locks the Critical Section L1 and thread B locks the Critical Section L2, simultaneously. There is no conflict up to this point. However, when thread A tries to lock L2 and thread B tries to lock L1, conflict occurs. This is because thread B already holds L2, and thread A already holds L1. This results in a deadlock. If a single programmer writes both the threadA() and threadB() functions, such a deadlock is not likely to happen. However, if two programmers write the two functions separately, such a locking hierarchy is more likely to occur. Clarify their queries (if any) related to this slide. ThreadA: L1, then L2 ThreadB: L2, then L1 2018/9/19

Examples of Deadlock (Continued) Consider the following example of a deadlock: typedef struct { // some data things SomeLockType mutex; } shape_t; shape_t Q[1024]; void swap(shape_t A,shape_t B) { lock(A.mutex); lock(B.mutex); // Swap data between A & B unlock(B.mutex); unlock(A.mutex); } swap(Q[34],Q[98]); Thread 1 swap(Q[98],Q[34]); Thread 2 Grabs mutex 34 Consider another example where an array has a large number of elements and threads have to frequently update elements. You can create a mutual exclusion on an element in the array by locking the entire array. This prevents other threads from accessing other elements in the array and, as a result, lowers performance. When you lock the entire array for one thread, you lock the application’s potential for parallel execution. To improve performance, you can add individual locks to individual elements. In this case, threads that do not update the same element will not interfere with each other and simultaneous updates may be done in parallel. In the declaration sample, the structure type, shape_t, contains some data elements and a lock. Also, there is an array Q[] of 1024 shape_t elements. Each element in Q[] has its own lock. Now, consider that you want to perform a swap operation of two elements in the array Q[]. Consider the following code sample, which displays the swap operation. In the code sample, the swap() function locks the two shape_t elements A and B, respectively. Then, it swaps data between A and B and unlocks B followed by A. As a result, there is no conflict if two threads, threads 1 and 2, try to execute the swap() function on different elements. Consider the following situation. Suppose thread 1 tries to swap the thirty-fourth element, Q[34], with the ninety-eighth element, Q[98], and at the same time, thread 2 tries to swap the ninety-eighth element, Q[98], with the thirty-fourth element, Q[34]. In such a scenario, the code does create a locking hierarchy. Thread 1 grabs the mutex at Q[34], and thread 2 grabs the mutex at Q[98]. Both these threads lock their first mutexes, which are then desired by the other thread. Therefore, even if you program correctly, potential locking hierarchies may still exist. Question for Discussion: Is there any way to ensure that such a locking hierarchy does not happen? Answer: Any solution requires checking the memory address of each lock and then locking it in some order, such as locking the lowest memory address first. If such a condition happens only one out of a hundred million times, it may add overhead. Clarify their queries (if any) related to this slide. Grabs mutex 98 2018/9/19

! Thread Stalls Definitions: Thread stall refers to a situation when a thread waits for an inordinate amount of time, usually for a response from another thread. Dangling locks can create thread stalls. Dangling locks are created when a thread locks a resource and terminates by exception before unlocking the resource. You may want the thread to wait in the following cases: Master thread creates worker threads and waits for them to terminate at the end of their concurrent execution. Thread waits at a barrier for synchronization. Define thread stall as a situation where a thread waits for an inordinate amount of time. In some cases, you may want the thread to wait. For example, a master thread creates worker threads and waits for them to terminate at the end of their concurrent execution. Another example is that of a thread waiting at a barrier for synchronization. Dangling locks are another reason for thread stalls. Dangling locks are created when a thread locks a resource and terminates by exception before unlocking the resource. In such a scenario, any thread that wants the locked resource will wait forever. Ensure that threads release all locks to avoid deadlocks and thread stalls. Clarify their queries (if any) related to this slide. ! Note Ensure that threads release all locks in all cases (even unexpected termination) to avoid deadlocks and thread stalls. 2018/9/19

Example – Dangling Lock Consider the following example of a dangling lock: int data; DWORD WINAPI threadFunc(LPVOID arg) { int localData; EnterCriticalSection(&lock); if (data==DONE_FLAG) return(1); localData=data; LeaveCriticalSection(&lock); process(local_data); return(0); } Lock never released Present the code sample on the slide. In the code sample, the thread executing the threadFunc() function holds the Critical Section and reads the data. If the process is over for the thread, the global variable data holds the value DONE_FLAG, it immediately returns from the function. If the processing is incomplete, the thread copies the global data from the data variable to the local data variable localdata. Then, it leaves the critical region by releasing the Critical Section and processes the local data. A problem occurs when the thread returns from the function without unlocking the Critical Section. In such a scenario, other threads cannot access the Critical Section assuming that a thread has already locked it. Windows operating system can detect an abandoned mutex. As a result, the mutex can be assigned to other threads. Critical Sections do not provide such options. Clarify their queries (if any) related to this slide. 2018/9/19

Activity 2: Identifying and Fixing Deadlocks Objective: Find actual and potential deadlocks and validate that the errors are fixed using Intel® Thread Checker. Questions for Discussion: What build options are required for any threaded software development? What build options are required for binary and source instrumentation? Introduction of the Second Lab Activity. Explain to the participants the objective for the activity. Questions for Discussion: What build options are required for any threaded software development? What build options are required for binary and source instrumentation? Clarify their queries (if any) related to this slide. 2018/9/19

Understanding Thread-Safety A code sample or routine is thread-safe if it functions correctly when multiple threads try to execute it simultaneously. To test for thread-safety, perform the following: Use OpenMP simulations within Intel® Thread Checker to determine any potential conflicts. Use OpenMP sections to create concurrent execution from multiple threads. Define thread-safety: A code sample or routine is thread-safe if it functions correctly when multiple threads try to execute it simultaneously. Consider a routine in which no global variables are updated and no data races or conflicts occur. Such routines are termed as thread-safe. You can use OpenMP and Intel Thread Checker to check whether your application is thread-safe. This is useful when you use third-party libraries and do not have access to the source code. You can create OpenMP sections and use OpenMP simulations within Intel Thread Checker to determine any potential conflicts in those libraries. You can also use the OpenMP sections to simulate or create concurrent execution from multiple threads. Clarify their queries (if any) related to this slide. 2018/9/19

Example – Check for Thread-Safety Consider the following example to understand thread-safety: Check for safety issues between: Multiple instances of routine1(). Instances of routine1() and routine2(). You need to provide data sets that exercise relevant parts of the routines. #pragma omp parallel sections { #pragma omp section routine1(&data1); routine1(&data2); routine2(&data3); } Present an example to explain the concept of thread safety: The code sample has two routines, routine1 and routine2. You want to check whether routine1 is thread-safe with itself and routine2 is thread-safe with routine1. To do this you insert OpenMP parallel sections in the code to test all desired permutations of routine calls. Highlight that you need to provide data sets that exercise relevant parts of the routines. Each parallel section calls one of the two routines with some data set in some order. You want to test routine1 to be thread-safe when another thread is calling the same routine. You also want to test the thread-safety of routine1 and routine2 being called concurrently. Clarify their queries (if any) related to this slide. 2018/9/19

Ways to Ensure Thread-Safety Two ways to ensure thread-safety are: Reentrant code: No globally shared variables are updated by the routine. Mutual exclusion: Shared variables are protected when being updated by the routine. ! Note It is better to write reentrant code than to add synchronization objects. Using reentrant code improves performance and avoids implicit barriers and other overhead. Introduce the slide by saying that it is important to ensure thread-safety in multithreaded applications and libraries that may be called from multithreaded applications. Explain the two ways to ensure thread-safety: Reentrant code: You can write your routines to be reentrant so that no globally shared variables are updated by the routine. Any variable that the routine changes must be local. You can write code in such a way that it can be interrupted during one task and reentered to perform another task. When the second task completes, the code can resume its original task. Mutual exclusion: If your code accesses or modifies shared variables, you can use mutual exclusion to avoid conflicts with other threads. Consider a situation where multiple threads try to access the shared stdout device to print messages. The printf library uses mutual exclusion to ensure that only one thread accesses stdout to print something. Highlight that it is better to write reentrant code than to add synchronization objects. Using reentrant code improves performance and avoids implicit barriers and potential overhead. Question for Discussion: How can you ensure that third-party libraries are thread-safe? Answer: You can: Read the library documentation. Contact library author, especially if documentation is unclear or does not address the issue. Test with Intel Thread Checker, especially if you do not trust the library author. Clarify their queries (if any) related to this slide. 2018/9/19

Activity 3: Testing Libraries for Thread-Safety Objective: Use Intel® Thread Checker to determine if libraries are thread-safe. Introduction of the Third Lab Activity. Explain to the participants the objective for the activity. Clarify their queries (if any) related to this slide. 2018/9/19

Summary Intel® Thread Checker is a debugging tool for threaded software. It is a plug-in to the Intel® VTuneTM Performance Analyzer with the same look, feel, and interface as the VTune Analyzer environment. Intel Thread Checker can be used as a design, debug, and quality aid. Intel Thread Checker supports compilers such as Intel® C++ and Fortran. Intel Thread Checker uses an advanced error-detection engine to identify data races and deadlocks. Intel Thread Checker enables you to sort errors based on various categories, such as severity level, error description, context, function, or variable. Instrumentation adds code to record information on threading and synchronization APIs and memory accesses. This will increase the size of the executable binary. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) To add a relocation section, you can set the environment variable LINK to include the /fixed:no value and relink your code. When you run an Intel® Thread Checker Activity in the Intel® VTuneTM Performance Analyzer environment, binary instrumentation is automatically added to your code. You can use this instrumentation for software compiled with supported compilers. Source instrumentation refers to source code instrumentation, which is only available through Intel compilers. A write–read conflict is a situation when one thread updates an unprotected variable that is subsequently read by another thread. A read–write conflict is a situation when one thread reads an unprotected variable that is subsequently updated by another thread. Summarize all the key points learned in the chapter. 2018/9/19

Summary (Continued) A write–write conflict is a situation when one thread updates an unprotected variable that is subsequently updated by another thread. There are two ways to solve race conditions: scope variables local to threads and control shared access with critical regions. Deadlock is a situation when a thread waits for an event that will never occur. Thread stall refers to a situation when a thread waits for an inordinately long amount of time, usually for a response from another thread. A thread-safe routine is a routine that multiple threads can safely call concurrently. Two ways to ensure thread-safety are reentrant code and mutual exclusion. It is better to write reentrant code than to add synchronization. Using reentrant code improves performance and avoids implicit barriers and potential overhead. Summarize all the key points learned in the chapter. 2018/9/19