Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1.

Similar presentations


Presentation on theme: "Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1."— Presentation transcript:

1 Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

2 并行让程序运行的更快 2

3 Why Do I Need to Know This? What’s in It for Me? There’s no way to avoid this topic – Multicore processors are here now and here to stay – “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” (Dr. Dobb’s Journal, March 2005) 3

4 Isn’t Concurrent Programming Hard? Concurrent programming is no walk in the park. With a serial program – execution of your code takes a predictable path through the application. Concurrent algorithms – require you to think about multiple execution streams running at the same time 4

5 PRIMER ON CONCURRENT PROGRAMMING Concurrent programming is all about independent computations that the machine can execute in any order. Not everything within an application will be independent, so you will still need to deal with serial execution amongst the concurrency. 5

6 Four Steps of a Threading Methodology Step 1. Analysis: Identify Possible Concurrency – Find the parts of the application that contain independent computations. – identify hotspots that might yield independent computations 6

7 Four Steps of a Threading Methodology Step 2. Design and Implementation: Threading the Algorithm This step is what this book is all about 7

8 Four Steps of a Threading Methodology Step 3. Test for Correctness: Detecting and Fixing Threading Errors Step 4. Tune for Performance: Removing Performance Bottlenecks 8

9 9

10 Design Models for Concurrent Algorithms The way you approach your serial code will influence how you reorganize the computations into a concurrent equivalent. – Task decomposition Independent tasks that threads – Data decomposition Compute every element of the data independently. 10

11 Task Decomposition Any concurrent transformation process is to identify computations that are completely independent. Satisfy or remove dependencies 11

12 Example: numerical integration What are the independent tasks in this simple application? Are there any dependencies between these tasks and, if so, how can we satisfy them? How should you assign tasks to threads? 12

13 Three key elements for any task decomposition design What are the tasks and how are they defined? What are the dependencies between tasks and how can they be satisfied? How are the tasks assigned to threads? 13

14 Two criteria for the actual decomposition into tasks There should be at least as many tasks as there will be threads (or cores). The amount of computation within each task (granularity) must be large enough to offset the overhead that will be needed to manage the tasks and the threads. 14

15 15

16 What are the dependencies between tasks and how can they be satisfied? Order dependency – some task relies on the completed results of the computations from another task – schedule tasks that have an order dependency onto the same thread – insert some form of synchronization to ensure correct execution order 16

17 What are the dependencies between tasks and how can they be satisfied? Data dependency – assignment of values to the same variable that might be done concurrently – updates to a variable that could be read concurrently – create variables that are accessible only to a given thread. – Atomic Operation 17

18 How are the tasks assigned to threads? Tasks must be assigned to threads for execution. The amount of computation done by threads should be roughly equivalent. We can allocate tasks to threads in two different ways: static scheduling or dynamic scheduling. 18

19 How are the tasks assigned to threads? In static scheduling, the division of labor is known at the outset of the computation and doesn’t change during the computation. Static scheduling is best used in those cases where the amount of computation within each task is the same or can be predicted at the outset. 19

20 How are the tasks assigned to threads? Under a dynamic schedule, you assign tasks to threads as the computation proceeds. The driving force behind the use of a dynamic schedule is to try to balance the load as evenly as possible between threads. 20

21 Example: numerical integration What are the independent tasks in this simple application? Are there any dependencies between these tasks and, if so, how can we satisfy them? How should you assign tasks to threads? 21

22 Data Decomposition Execution is dominated by a sequence of update operations on all elements of one or more large data structures. These update computations are independent of each other Dividing up the data structure(s) and assigning those portions to threads, along with the corresponding update computations (tasks) 22

23 Three key elements for data decomposition design How should you divide the data into chunks? How can you ensure that the tasks for each chunk have access to all data required for updates? How are the data chunks assigned to threads? 23

24 How should you divide the data into chunks? 24

25 How should you divide the data into chunks? Granularity of chunk Shape of chunk – the neighboring chunks are and how any exchange of data More vigilant with chunks of irregular shapes 25

26 How can you ensure that the tasks for each chunk have access to all data required for updates? 26

27 Example: Game of Life on a finite grid 27

28 Example: Game of Life on a finite grid 28

29 Example: Game of Life on a finite grid What is the large data structure in this application and how can you divide it into chunks? What is the best way to perform the division? 29

30 What’s Not Parallel Algorithms with State – something kept around from one execution to the next. – For example, the seed to a random number generator or the file pointer for I/O would be considered state. 30

31 What’s Not Parallel Recurrences 31

32 What’s Not Parallel Induction Variables 32

33 What’s Not Parallel Reduction – Reductions take a collection (typically an array) of data and reduce it to a single scalar value through some combining operation. 33

34 Loop-Carried Dependence 34

35 35

36 Rule 1: Identify Truly Independent Computations It’s the crux of the whole matter! 36

37 Rule 2: Implement Concurrency at the Highest Level Possible Two directions : bottom-up and top-down bottom-up – consider threading the hotspots directly – If this is not possible, search up the call stack top-down – first consider the whole application and what the computation is coded to accomplish – While there is no obvious concurrency, distill the parts of the computation Video encoding application : individual pixels  frames  videos 37

38 Rule 3: Plan Early for Scalability to Take Advantage of Increasing Numbers of Cores Quad-core processors are becoming the default multicore chip. Flexible code that can take advantage of different numbers of cores. C. Northcote Parkinson, “Data expands to fill the processing power available.” 38

39 Rule 4: Make Use of Thread-Safe Libraries Wherever Possible Intel Math Kernel Library (MKL) Intel Integrated Performance Primitives (IPP) 39

40 Rule 5: Use the Right Threading Model Don’t use explicit threads if an implicit threading model (e.g., OpenMP or Intel Threading Building Blocks) has all the functionality you need. 40

41 Rule 6: Never Assume a Particular Order of Execution 41

42 Rule 7: Use Thread-Local Storage Whenever Possible or Associate Locks to Specific Data Synchronization is overhead that does not contribute to the furtherance of the computation Should actively seek to keep the amount of synchronization to a minimum. Using storage that is local to threads or using exclusive memory locations 42

43 Rule 8: Dare to Change the Algorithm for a Better Chance of Concurrency When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution O(n log n) algorithm will run faster than an O(n 2 ) algorithm If you cannot easily turn a hotspot into threaded code, you should consider using a suboptimal serial algorithm to transform, rather than the algorithm currently in the code. 43

44 44

45 Parallel Sum 45

46 PRAM Algorithm 46

47 PRAM Algorithm 47 Can we use the PRAM algorithm for parallel sum in a threaded code?

48 A More Practical Algorithm Divide the data array into chunks equal to the number of threads to be used. Assign each thread a unique chunk and sum the values within the assigned subarray into a private variable. Add these local partial sums to compute the total sum of the array elements. 48

49 Prefix Scan 49

50 Prefix Scan PRAM computation for prefix scan 50

51 Prefix Scan A More Practical Algorithm 51

52 52

53 Implicit Threading Implicit threading libraries take care of much of the minutiae needed to create, manage, and (to some extent) synchronize threads. All the little niggly details are hidden from programmers to make concurrent programming easier to implement and understand. OpenMP implements concurrency through special pragmas and directives inserted into your source code to indicate segments that are to be executed concurrently. These pragmas are recognized and processed by the compiler. Intel TBB uses defined parallel algorithms to execute methods within user-written classes that encapsulate the concurrent operations. 53

54 OpenMP OpenMP is a set of compiler directives, library routines, and environment variables that specify shared-memory concurrency in FORTRAN, C, and C++ programs. All major compilers support the OpenMP language. – Microsoft Visual C/C++.NET for Windows – GNU GCC compiler for Linux. – Intel C/C++ compilers, for both Windows and Linux 54

55 OpenMP OpenMP directives demarcate code that can be executed in parallel (called parallel regions ) and control how code is assigned to threads. For C and C++ #pragma omp parallel OpenMP also has an atomic construct to ensure that statements will be executed in an atomic OpenMP provides a clause to handle the details of a concurrent reduction. 55

56 OpenMP 56

57 Intel Threading Building Blocks Intel TBB is a C++ template-based library for loop- level parallelism that concentrates on defining tasks rather than explicit threads. Programmers using TBB can parallelize the execution of loop iterations by treating chunks of iterations as tasks and allowing the TBB task scheduler to determine: – the task sizes – number of threads to use – assignment of tasks to those threads – how those threads are scheduled for execution. 57

58 Intel Threading Building Blocks 58

59 Explicit Threading Explicit threading libraries require the programmer to control all aspects of threads, including – creating threads – associating threads to functions – Synchronizing – controlling the interactions between threads and shared resources. 59

60 Pthreads Pthreads has a thread container data type of pthread_t. To create a thread and associate it with a function for execution, use the pthread_create() function When one thread needs to be sure that some other thread has terminated before proceeding with execution, it calls pthread_join(). 60

61 Pthreads Threads request the privilege of holding a mutex by calling pthread_lock(). Other threads attempting to gain control of the mutex will be blocked until the thread that is holding the lock calls pthread_unlock(). 61

62 Pthreads Threads block and wait on a condition variable to be signaled when calling pthread_cond_wait() on a given condition variable. executing thread calls pthread_cond_signal() on a condition variable to wake up a thread that has been blocked. The pthread_cond_broadcast() function will wake all threads that are waiting on the condition variable. 62

63 Pthreads 63

64 Questions? 64


Download ppt "Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1."

Similar presentations


Ads by Google