COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Parallel Algorithm Design

Lecture 7: Task Partitioning and Mapping to Processes Shantanu Dutt ECE Dept., UIC.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Parallel Processing (CS453). How does one decompose a task into various subtasks? While there is no single recipe that works for all problems, we present.

CPSC 171 Introduction to Computer Science More Efficiency of Algorithms.

并行算法概述. 目录 1. 并行计算模型 2. 并行算法的基本设计技术 2 Von Neumann Model 3.

并行算法概述. 2 目录  1. 并行计算模型 2. 并行算法的基本设计技术 3 Von Neumann Model.

Reference: Message Passing Fundamentals.

Principles of Parallel Algorithm Design

Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,

1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.

UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.

Principles of Parallel Algorithm Design

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Parallel Processing (CS526) Spring 2012(Week 4).  Parallelism from two perspectives: ◦ Platform  Parallel Hardware Architecture  Parallel Communication.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

计算机科学概述 Introduction to Computer Science 陆嘉恒中国人民大学信息学院

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Static Process Scheduling

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Principles of Parallel Algorithm Design (reading Chp.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Pthread RW Locks – An Application Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Thread Basics Dr. Xiao Qin Auburn University

Auburn University

Auburn University

Auburn University

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Odd-Even Sort Implementation Dr. Xiao Qin.

Parallel Tasks Decomposition

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Pthread Mutex Dr. Xiao Qin Auburn University.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.

Principles of Parallel Algorithm Design

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

Principles of Parallel Algorithm Design

COMP60621 Fundamentals of Parallel and Distributed Systems

Principles of Parallel Algorithm Design

Principles of Parallel Algorithm Design

Principles of Parallel Algorithm Design

Principles of Parallel Algorithm Design

Principles of Parallel Algorithm Design

To accompany the text “Introduction to Parallel Computing”,

COMP60611 Fundamentals of Parallel and Distributed Systems

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University

Recap: Multiplying a Dense Matrix with a Vector Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and vector accessed by Task 1. Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of the same size in terms of number of operations. Is this the maximum number of tasks we could decompose this problem into? 2

Recap: Database Query Processing The execution of the query can be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause. Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next. 3

Recap The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. The length of the longest path in a task dependency graph is called the critical path length. 4

Degree of Concurrency The average degree of concurrency is the average number of tasks that can be processed in parallel over the execution of the program. 5 Average degree of concurrency = total amount of work / critical path length

Q1: What the average degree of concurrency in each decomposition? 6 (a) Critical path length = 27, total amount of work = 63 Average degree of concurrency = 63/27 = 2.33 (b) Critical path length = 34, total amount of work = 64 Average degree of concurrency = 64/34 = 1.88

Limits on Parallel Performance It would appear that the parallel time can be made arbitrarily small by making the decomposition finer in granularity. There is an inherent bound on how fine the granularity of a computation can be. 7

Q2: What is the upper bound of the number of concurrent tasks? Multiplying a dense matrix with a vector: there can be no more than (n 2 ) concurrent tasks. 8

Q3: The larger the number of concurrent tasks, the better? Communication overhead: Concurrent tasks may have to exchange data with other tasks. The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds. 9

Task Interaction Graphs Subtasks exchange data with others in a decomposition. The graph of tasks and their interactions/data exchange is referred to as a task interaction graph. Note: task interaction graphs represent data dependencies, whereas task dependency graphs represent control dependencies. 10

Q4: Can you explain this task interaction graph? Multiplying a sparse matrix A with a vector b. The computation of each element of the result vector is a task. Only non-zero elements of matrix A participate in the computation. We partition b across tasks, then the task interaction graph of the computation is identical to the graph of the matrix A 11

Task Interaction Graphs, Granularity, and Communication If the granularity of a decomposition is finer, the associated overhead (as a ratio of useful work associated with a task) increases. 12

Task Interaction Graphs, Granularity, and Communication Example: Each node takes unit time to process and each interaction (edge) causes an overhead of a unit time. 13 Viewing node 0 as an independent task involves a useful computation of one time unit and overhead (communication) of three time units. How to solve this problem?

Summary Average Degree of Concurrency Task Granularity and Parallel Performance Task Interaction Graphs 14

COMP7330/7336 Advanced Parallel and Distributed Computing Processes and Mapping Dr. Xiao Qin Auburn University Slides are adopted from Drs. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Processes and Mapping In general, the number of tasks in a decomposition exceeds the number of processing elements available. A parallel algorithm must provide a mapping of tasks to processes. 16

Processes and Mapping We refer to the mapping as being from tasks to processes, as opposed to processors. Typical programming APIs do not allow easy binding of tasks to physical processors. We aggregate tasks into processes Rely on the system to map these processes to physical processors. A process (not in the UNIX sense of a process) is a collection of tasks and associated data. 17

Processes and Mapping Appropriate mapping of tasks to processes is critical to the parallel performance of an algorithm. Mappings are determined by both the task dependency and task interaction graphs. Q5: Task dependency graphs can be used to ensure ? Q6: Task interaction graphs can be used to ensure ? 18 Task dependency graphs can be used to ensure that work is equally spread across all processes at any point (minimum idling and optimal load balance). Task interaction graphs can be used to make sure that processes need minimum interaction with other processes (minimum communication).

Processes and Mapping Q7: An appropriate mapping must minimize parallel execution time by (How?) Mapping independent tasks to different processes. Assigning tasks on critical path to processes as soon as they become available. Minimizing interaction between processes by mapping tasks with dense interactions to the same process. 19 Q8: Are these criteria conflicting with each other?

Processes and Mapping: Example Mapping tasks in the database query decomposition to processes. (Q9: How?) Viewing the dependency graph in terms of levels (no two nodes in a level have dependencies). Tasks within a single level are then assigned to different processes. 20

Decomposition Techniques So how does one decompose a task into various subtasks? recursive decomposition data decomposition exploratory decomposition speculative decomposition 21

Recursive Decomposition Generally suited to problems that are solved using the divide-and-conquer strategy. A given problem is first decomposed into a set of sub-problems. These sub-problems are recursively decomposed further until a desired granularity is reached. 22

Recursive Decomposition: Quicksort Once the list has been partitioned around the pivot, each sublist can be processed concurrently (i.e., each sublist represents an independent subtask). This can be repeated recursively. 23

Recursive Decomposition: Finding the minimum number A simple serial loop: 1. procedure SERIAL_MIN (A, n) 2. begin 3. min = A[0]; 4. for i := 1 to n − 1 do 5. if (A[i] < min) min := A[i]; 6. endfor; 7. return min; 8. end SERIAL_MIN 24 Q10: How can you rewrite the loop using recursive decomposition?

Recursive Decomposition: Finding the minimum number 1. procedure RECURSIVE_MIN (A, n) 2. begin 3. if ( n = 1 ) then 4. min := A [0] ; 5. else 6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 ); 8. if (lmin < rmin) then 9. min := lmin; 10. else 11. min := rmin; 12. endelse; 13. endelse; 14. return min; 15. end RECURSIVE_MIN 25

Summary Processes and Mapping Recursive Decomposition 26