CPE 779 Parallel Computing - Spring 20101 Creating and Using Threads Based on Slides by Katherine Yelick

Slides:



Advertisements
Similar presentations
Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL.
Advertisements

Analysis of programs with pointers. Simple example What are the dependences in this program? Problem: just looking at variable names will not give you.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Lecture 7: Task Partitioning and Mapping to Processes Shantanu Dutt ECE Dept., UIC.
CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Principles of Parallel Algorithm Design
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
Virtues of Good (Parallel) Software
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Parallel Processing (CS526) Spring 2012(Week 4).  Parallelism from two perspectives: ◦ Platform  Parallel Hardware Architecture  Parallel Communication.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.
Multicore Programming (Parallel Computing) HP Training Session 5 Prof. Dan Connors.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.
CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Principles of Parallel Algorithm Design (reading Chp.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Auburn University
The NP class. NP-completeness
Auburn University
Auburn University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Distributed Processors
Parallel Tasks Decomposition
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Computer Engg, IIT(BHU)
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Amir Kamil and Katherine Yelick
CSE332: Data Abstractions About the Final
Concurrent Graph Exploration with Multiple Robots
Designing Parallel Algorithms (Synchronization)
Principles of Parallel Algorithm Design
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Unit-2 Divide and Conquer
Principles of Parallel Algorithm Design
Parallel Computation Patterns (Reduction)
Background and Motivation
Dr. Mustafa Cem Kasapbaşı
Parallelismo.
COMP60621 Fundamentals of Parallel and Distributed Systems
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Practical Session 8, Memory Management 2
Amir Kamil and Katherine Yelick
Lecture 2 The Art of Concurrency
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
To accompany the text “Introduction to Parallel Computing”,
Course Description Algorithms are: Recipes for solving problems.
COMP60611 Fundamentals of Parallel and Distributed Systems
Practical Session 9, Memory Management continues
Presentation transcript:

CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick

CPE 779 Parallel Computing - Spring Parallel Programming Models Programming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? How are dependencies (orderings) enforced? Data Can data be shared or is it all private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations?

CPE 779 Parallel Computing - Spring Simple Example Consider applying a function square to the elements of an array A and then computing its sum: What can be done in parallel? How long would it take (in big-O) if we have an unlimited number of processors? A = array of all data Asqr = map(square, A) s = sum(Asqr) A: Asqr: sum s: square

CPE 779 Parallel Computing - Spring Problem Decomposition In designing a parallel algorithm Decompose the problem into smaller tasks Determine which can be done in parallel with each other, and which must be done in some order Conceptualize a decomposition as a task dependency graph: A directed graph with Nodes corresponding to tasks Edges indicating dependencies, that the result of one task is required for processing the next. A given problem may be decomposed into tasks in many different ways. Tasks may be of same, different, or indeterminate sizes. sqr (A[0])sqr(A[1]) sqr(A[2])sqr(A[n]) sum …

CPE 779 Parallel Computing - Spring Example: Multiplying a Matrix with a Vector Dependencies: Each output element of y depends on one row of A and all of x. Task graph: Since each output is independent, our task graph can have n nodes and no dependence edges Observations: All tasks are of the same size in terms of number of operations. Question: Could we have more tasks? Fewer? Slide derived from: Grama, Karypis, Kumar and Gupta x for i = 1 to n for j = 1 to n y[i] += A[i,j] * x[j];

CPE 779 Parallel Computing - Spring Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE”) on the following database: ID#ModelYearColorDealerPrice 4523Civic2002BlueMN$18, Corolla1999WhiteIL$15, Camry2001GreenNY$21, Prius2001GreenCA$18, Civic2001WhiteOR$17, Altima2001GreenFL$19, Maxima2001BlueNY$22, Accord2000GreenVT$18, Civic2001RedCA$17, Civic2002RedWA$18,000 Slide derived from: Grama, Karypis, Kumar and Gupta

CPE 779 Parallel Computing - Spring Example: Database Query Processing One decomposition creates tasks that generate an intermediate table of entries that satisfy a particular clause. Slide derived from: Grama, Karypis, Kumar and Gupta

CPE 779 Parallel Computing - Spring Example: Database Query Processing Here is a different decomposition Choice of decomposition will affect parallel performance. Slide derived from: Grama, Karypis, Kumar and Gupta

CPE 779 Parallel Computing - Spring Granularity of Task Decompositions Granularity is related to the number of tasks into which a problem is decomposed. Fine-grained: large number of small tasks Coarse-grained: smaller number of larger tasks A coarse grained version of dense matrix-vector product. Slide derived from: Grama, Karypis, Kumar and Gupta

CPE 779 Parallel Computing - Spring Degree of Concurrency The degree of concurrency of a task graph is the number of tasks that can be executed in parallel. May vary over the execution, so we can talk about the maximum or average The degree of concurrency increases as the decomposition becomes finer in granularity. A directed path in a task graph represents a sequence of tasks that must be processed one after the other. The critical path is the longest such path. These graphs are normally weighted by the cost of each task (node), and the path lengths are the sum of weights Slide derived from: Grama, Karypis, Kumar and Gupta

CPE 779 Parallel Computing - Spring Limits on Parallel Performance Parallel time can be made smaller by making decomposition finer. There is an inherent bound on how fine the granularity of a computation can be. For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n 2 ) concurrent tasks. In addition, there may be communication overhead between tasks. Slide derived from: Grama, Karypis, Kumar and Gupta

CPE 779 Parallel Computing - Spring Shared Memory Programming Model Program is a collection of threads of control. Can be created dynamically Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variables s =... PnP1 P0 s y =..s... Shared memory i: 2 i: 5 Private memory i: 8

CPE 779 Parallel Computing - Spring Shared Memory “Code” for Computing a Sum Thread 1 for i = 0, (n/2)-1 s = s + sqr(A[i]) Thread 2 for i = n/2, n-1 s = s + sqr(A[i]) static int s = 0; Problem is a race condition on variable s in the program A race condition or data race occurs when: -two processors (or two threads) access the same variable, and at least one does a write. -The accesses are concurrent (not synchronized) so they could happen simultaneously

CPE 779 Parallel Computing - Spring Shared Memory Code for Computing a Sum Assume A = [3,5], f is the square function, and s=0 initially

CPE 779 Parallel Computing - Spring Shared Memory Code for Computing a Sum Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or Af = square

CPE 779 Parallel Computing - Spring Shared Memory Code for Computing a Sum Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or Af = square

CPE 779 Parallel Computing - Spring Shared Memory Code for Computing a Sum Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or Af = square

CPE 779 Parallel Computing - Spring Corrected Code for Computing a Sum Thread 1 local_s1= 0 for i = 0, (n/2)-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s static lock lk; lock(lk); unlock(lk); lock(lk); unlock(lk);