Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Selected Topics from Chapter 3: Principles of Parallel Algorithm Design

Elements of a Parallel Algorithm
Pieces of work that can be done concurrently Tasks Mapping of the tasks onto multiple processors Processes vs processors Distributing the input, output, and intermediate results across different processors Management of access to shared data Either input or intermediate Synchronization of the processors at various point of the parallel execution

Finding Concurrent Pieces of Work
Decomposition The process of dividing the computation into smaller pieces of work called tasks Tasks are programmer defined and are considered to be indivisible.

Tasks can be of different sizes

Task-Dependency Graph
In most cases, there are dependencies between the different tasks Certain task(s) can only start once some other task(s) have finished Example: Producer-consumer relationships These dependencies are represented using a DAG called a task-dependency graph

Task-Dependency Graph (cont)
A task-dependency graph is a directed acyclic graph in which the nodes represent tasks and the directed edges indicate the dependences between them The task corresponding to a node can be executed when all tasks connected to this node by incoming edges have been completed. The number and size of the tasks that the problem is decomposed into determines the granularity of the decomposition. Called fine-grained for a large nr of small tasks Called coarse-grained for a small nr of large tasks

Task-Dependency Graph (cont)
Key Concepts Derived from Task-Dependency Graph Degree of Concurrency The number of tasks that can be executed concurrently We are usually most concerned about the average degree of concurrency Critical Path The longest vertex-weighted path in the graph The weights inside nodes represent the task size Is the sum of the weights of nodes along the path The degree of concurrency and critical path length normally increase as granularity becomes smaller

Q Task-Interaction Graph
Captures the pattern of interaction between tasks This graph usually contains the task-dependency graph as a subgraph. True since there may be interactions between tasks even if there are no dependencies. These interactions usually due to accesses of shared data

Task Dependency and Interaction Graphs
These graphs are important in developing effective mapping of the tasks onto the different processors Need to maximize concurrency and minimize overheads.

Common Decomposition Methods
Data Decomposition Recursive Decomposition Exploratory Decomposition Speculative Decomposition Hybrid Decomposition task decomposition methods

Recursive Decomposition
Suitable for problems that can be solved using the divide and conquer paradigm Each of the subproblems generated by the divide step becomes a new task.

Example: Quicksort

Another Example: Finding the Minimum
Note that we can obtain divide-and-conquer algorithms for problems that are usually solved by using other methods.

Recursive Decomposition
How good are the decompositions produced? Average Concurrency? Length of critical path? How do the quicksort and min-finding decompositions measure up?

Data Decomposition Used to derive concurrency for problems that operate on large amounts of data The idea is to derive the tasks by focusing on the multiplicity of data Data decomposition is often performed in two steps: Step 1: Partition the data Step 2: Induce a computational partitioning from the data partitioning.

Data Decomposition (cont)
Which data should we partition Input/Output/Intermediate? All of above This leads to different data decomposition methods How to induce a computational partitioning Use the “owner-computes” rule

Exploratory Decomposition
Used to decompose computations that correspond to a search of the space of solutions.

Example: 15-puzzle Problem

Exploratory Decomposition
Not general purpose After sufficient branches are generated, each node can be assigned the task to explore further down one branch As soon as one task finds a solution, the other tasks can be terminated. It can result in speedup and slowdown anomalies The work performed by the parallel formulation of an algorithm can be either smaller or greater than that performed by the serial algorithm.

Speculative Decomposition
Used to extract concurrency in problems in which the next step is one of many possible actions that can only be determined when the current task finishes. This decomposition assumes a certain outcome of the currently executed task and executes some of the next steps Similar to speculative execution at the microprocessor

Speculative Decomposition
Difference from exploratory decompostion In speculative decomposition, the input at a branch leading to multiple tasks is unknown. In exploratory decomputation, the output of the multiple tasks originating at the branch is unknown.

Example: Discrete Event Simulation

Speculative Execution
If predictions are wrong Work is wasted Work may need to be undone State-restoring overhead Memory/computations However, it may be the only way to extract concurrency!

Mapping Tasks to Processors
A good mapping strives to achieve the following conflicting goals: Reducing the amount of that processor spend interacting with each other. Reducing the amount of total time that some processors are active while others are idle. Good mappings attempt to reduce the parallel processing overheads If Tp is the parallel runtime using p processors and Ts is the sequential runtime (for the same algorithm), then the total overhead To is p×Tp – Ts. This is the work that is done by the parallel system that is beyond that required for the serial system.

Add Slides from Karypis Lecture Slides
Add Slides here from the PDF lecture slides by Karypis for Chapter 3 of the textbook, Introduction to Parallel Computing, Second Edition, Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Addison Wesley, 2003. Topics covered on these slides are sections Characteristics of Tasks and Interactions Mapping Techniques for Load Balancing Methods for Containing Interaction Overheads These slides can be easiest seen by going to “View” and choosing “Full Screen”. Exit from “Full Screen” using “Esc” key.

Parallel Algorithm Models
The Task Graph Model Closely related to Foster’s Task/Channel Model Includes the task dependency graph where dependencies usually result from communications between two tasks Also includes the task-interaction graph, which also captures other interactions between tasks such as data sharing The Work Pool Model The Master-Slave Model The Pipeline or Producer-Consumer Model Hybrid Models

The Task Graph Model The computations in a parallel algorithm can be viewed as a task-dependency graph. Tasks are mapped to processors so that locality is promoted Volume and frequency of interactions are reduced Asynchronous interaction methods are used to overlap interactions with computation Typically used to solve problems in which the data related to a task is rather large compared to the amount of computation.

The Task Graph Model (cont.)
Examples of algorithms based on task graph model Parallel Quicksort (Section 9.4.1) Sparse Matrix Factorization Multiple parallel algorithms derived from divide-and-conquer decompositions. Task Parallelism The type of parallelism that is expressed by the independent tasks in a task-dependency graph.

The Work Pool Model Also called the “Task Pool Model”
Involves dynamic mapping of tasks onto processes for load balancing Any task may be potentially be performed by any process The mapping of tasks to processors can be centralized or decentralized. Pointers to tasks may be stored in a physically shared list, a priority queue, hash table, or tree a physically distributed data structure.

The Work Pool Model (cont.)
When work is generated dynamically and a decentralized mapping is used, then a termination detection algorithm is required When used with a message passing paradigm, normally the data required by the tasks is relatively small when compared to the computations Tasks can be readily moved around without causing too much data interaction overhead Granularity of tasks can be adjusted to obtain desired tradeoff between load imbalance and the overhead of adding and extracting tasks

The Work Pool Model (cont.)
Examples of algorithms based on the Work Pool Model Chunk-Scheduling

Master-Slave Model Also called the Manager-Worker model
One or more master processes generate work and allocate it to workers Managers can allocate tasks in advance if they can estimate the size of tasks or if a random mapping can avoid load-balancing problems Normally, workers are assigned smaller tasks, as needed Work can be performed in phases Work in each phase is completed and workers synchronized before next phase is started. Normally, any worker can do any assigned task

Master-Slave Model (cont)
Can be generalized to a multi-level manager-worker model Top level managers feed large chunks of tasks to second-level managers Second-level managers subdivide tasks to their workers and may also perform some of the work Danger of manager becoming a bottleneck Can happen if tasks are too small Granularity of tasks should be chosen so that cost of doing work dominates cost of synchronization Waiting time may be reduced if worker requests are non-deterministic.

Master-Slave Model (cont)
Examples of algorithms based on the Master-Slave Model A master-slave example for centralized load-balancing is mentioned for centralized dynamic load balancing in Section (page 130) Several examples are given in textbook, Barry Wilkinson and Michael Allen, “Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers”, 1st or 2nd Edition,1999 & 2005, Prentice Hall.

Pipeline or Producer-Consumer Model
Usually similiar to the “linear array model” studied in Akl’s textbook. A stream of data is passed through a succession of processes, each of which performs some task on it. Called Stream Parallelism With exception of process initiating the work for the pipeline, Arrival of new data triggers the execution of a new task by a process in the pipeline. Each process can be viewed as a consumer of the data items produced by the process preceding it

Pipeline or Producer-Consumer Model (cont)
Each process in pipeline can be viewed as a producer of data for the process following it. The pipeline is a chain of producers and consumers The pipeline does not need to be a linear chain. Instead, it can be a directed graph. Process could form pipelines in form of Linear or multidimensional arrays Trees General graphs with or without cycles

Pipeline or Producer-Consumer Model (cont)
Load balancing is a function of task granularity With larger tasks, it takes longer to fill up the pipeline This keeps tasks waiting Too fine a granularity increases overhead, as processes will need to receive new data and initiate a new task after a small amount of computation Examples of algorithms based on this model A two-dimensional pipeline is used in the parallel LU factorization algorithm discussed in Section 8.3.1 An entire chapter is devoted to this model in previously mentioned textbook by Wilkinson & Allen.

Hybrid Models In some cases, more than one model may be used in designing an algorithm, resulting in a hybrid algorithm Parallel quicksort (Section and 9.4.1) is an application for which a hybrid model is ideal.

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

Similar presentations

Presentation on theme: "Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

Similar presentations

Presentation on theme: "Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar"— Presentation transcript:

Similar presentations

About project

Feedback