Multi-Tasking Models and Algorithms

Multi-Tasking Models and Algorithms
General Concepts (Part I)

Outline for Multi-Tasking Models
Note: Items in black are in this slide set (Part I). Preliminaries Common Decomposition Methods Characteristics of Tasks and Interactions Mapping Techniques for Load Balancing Some Parallel Algorithm Models The Data-Parallel Model The Task Graph Model The Work Pool Model The Master-Slave Model The Pipeline or Producer-Consumer Model Hybrid Models

Outline (cont.) Algorithm examples for most of preceding algorithm models. This part currently missing & need to add next time. Some could be added as examples under Task/Channel model Task-Channel (Computational) Model Asynchronous Communication and Performance Evaluation Modeling Asynchronous Communicaiton Performance Metrics and Asynchronous Communications The Isoefficiency Metric & Scalability Future revision plans for preceding material. BSP (Computational) Model Slides posted separately on course website

References Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004. Particularly, Chapters 3 and 7 plus algorithm examples. Textbook slides for this book Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003. Particularly, Chapter 3 (available online) Also, Section 2.5 (Asynchronous Communications) Slides by the Authors’ Barry Wilkinson and Michael Allen, “Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers ”, Second Edition, Prentice Hall, 2005. Ian Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley, 1995, Online at

Change in Chapter Title
This chapter consists of three sets of slides. This chapter was formerly called Strictly Asynchronous Models The name has now been changed to Multi-Tasking Models However, the old name still occurs regularly in the internal slides.

Specifying Asynchronous Algorithms
Identifying parts that can be done concurrently Tasks Mapping of the tasks onto multiple processors Processes vs processors Distributing the input, output, and intermediate results across different processors Management of access to shared data Either input or intermediate Synchronization of the processors at various stages of the parallel execution

Finding Concurrent Pieces of Work
Decomposition The process of dividing the computation into smaller pieces of work called tasks Tasks are programmer defined and are considered to be indivisible. Tasks may be of arbitrary sizes Simultaneous execution of multiple tasks is the key to reducing time required

Example: Dense Matrix-Vector Multiplication
Tasks can be of different size Granularity of Task

Task-Dependency Graph
In most cases, there are dependencies between the different tasks Certain task(s) can only start once some other task(s) have finished Example: Producer-consumer relationships These dependencies are represented using a DAG called a task-dependency graph

Task-Dependency Graph (cont)
A task-dependency graph is a directed acyclic graph in which the nodes represent tasks and the directed edges indicate the dependences between them The task corresponding to a node can be executed when all tasks connected to this node by incoming edges have been completed. The number and size of the tasks that the problem is decomposed into determines the granularity of the decomposition. Called fine-grained for a large nr of small tasks Called coarse-grained for a small nr of large tasks

Task-Dependency Graph (cont)
Key Concepts Derived from Task-Dependency Graph Degree of Concurrency The number of tasks that can be executed concurrently We are usually most concerned about the average degree of concurrency Critical Path The longest vertex-weighted path in the graph The weights inside nodes represent the task size Is the sum of the weights of nodes along the path The degree of concurrency and critical path length normally increase as granularity becomes smaller.

Task-Interaction Graph
Captures the pattern of interaction between tasks This graph usually contains the task-dependency graph as a subgraph. True since there may be interactions between tasks even if there are no dependencies. These interactions usually due to accesses of shared data

Task Dependency and Interaction Graphs
These graphs are important in developing effective mapping of the tasks onto the different processors Need to maximize concurrency and minimize overheads.

Processes vs Processors
Process vs Processor Considered distinct concepts in this chapter. Process: A logical computing agent that performs tasks. Processor: Hardware units that physically perform computation. Usually a 1:1 correspondence between processors and processes. However, this distinction provides additional flexibility In order to obtain any speedup over sequential programming, parallel program must have several processes active at the same time, working on different tasks.

Mapping Tasks to Processes
Mapping: The way that tasks are assigned to processes for execution. Illustrated in Figures 3.5 and 3.7 Good maps attempt to Maximize the use of concurrency by mapping independent tasks onto different processors. Minimize total completion time by ensuring that tasks on the critical path are executed as quickly as they become available. Map tasks with a high degree of mutual interaction to the same process.

Decomposition Methods
Decomposition: Technique used to split the composition into a set of tasks. Common Decomposition techniques Data Decomposition Recursive Decomposition Exploratory Decomposition Speculative Decomposition Hybrid Decomposition Data and Recursive decompositions are general methods. Exploratory & Recursive decompositions special purpose. task decomposition methods

Recursive Decomposition
Suitable for problems that can be solved using the divide and conquer paradigm Each of the subproblems generated by the divide step becomes a new task. Results in natural concurrency, as different subproblems can be solved concurrently

Example: Quicksort

Another Example: Finding the Minimum
Note that we can obtain divide-and-conquer algorithms for problems that are usually solved by using other methods.

Recursive Decomposition
How good are the decompositions produced? Average Concurrency? Length of critical path? How do the quicksort and min-finding decompositions measure up?

Data Decomposition Used to derive concurrency for problems that operate on large amounts of data The idea is to derive the tasks by focusing on the multiplicity of data Data decomposition is often performed in two steps: Step 1: Partition the data Step 2: Induce a computational partitioning from the data partitioning. Which data should we partition Input/Output/Intermediate? All of above This leads to different data decomposition methods How to induce a computational partitioning Use the “owner-computes” rule

Example: Matrix-Matrix Multiplication

Matrix-Matrix Example (cont)
Note tasks created by previous decomposition is not unique.

Partitioning Intermediate Data
The partitioning of the matrix multiplication in Figure 3.10 into four tasks can be partitioned further by partitioning intermediate data. See next slide The matrices Di,j created are not computed in sequential algorithm and requires a change in sequential algorithm. Additionally, the creation of Di,j matrices require additional storage space.

“Owner-Computes" Rule Used when data decomposition is used to partition the work into tasks. This general principle requires that each partition performs all computations that involve the data it owns. This is illustrated in the next two slides.

Exploratory Decomposition
Used to decompose computations that correspond to a search of the space of solutions. The search space is partitioned into smaller parts and these are concurrently searched until desired solution is found. The next slide shows the initial configuration for the 15 puzzle and a sequence of moves leading to the final configuration. The subsequent slide shows how the state a state space search leads to the solution.

Not general purpose After sufficient branches are generated, each node can be assigned the task to explore further down one branch As soon as one task finds a solution, the other tasks can be terminated. It can result in speedup and slowdown anomalies The work performed by the parallel formulation of an algorithm can be either smaller or greater than that performed by the serial algorithm.

Not general purpose Can result in speedup anomalies Either engineered slow-down or superlinear speedup.

Speculative Decomposition
Used to extract concurrency in problems in which the next step is one of several actions that can only be determined when the current task finishes. While the current task is executing, other tasks can perform the computation of the multiple branches in parallel This decomposition method guarantees some wasteful computation. An alternate version is to explore only the most promising branch Or most promising branches

Speculative Decomposition
Difference from exploratory decompostion In speculative decomposition, the input at a branch leading to multiple tasks is unknown. In exploratory decomposition, the output of the multiple tasks originating at the branch is unknown. Speculative decomposition can lead to more, less, or the same amount of work compared to the serial program.

Speculative Execution
If predictions are wrong Work is wasted Work may need to be undone State-restoring overhead Memory/computations However, it may be the only way to extract concurrency!

Characteristics of Tasks
Task Generation Static: All tasks are known before execution of algorithm starts. Data decomposition usually results in static tasks Example: Matrix Multiplication Task Sizes Relative amount of time to complete it Uniform tasks: All require the same time Non-uniform tasks: Execution time varies significantly. Size of Data needed by a Task Data must be available to process performing task The size & location of this data may determine best process to perform task.

Some Task Interaction Characteristics
Static vs Dynamic Interactions Static interactions occur at predetermined times and involved predetermined tasks. Ex: Matrix multiplication Otherwise, interaction is dynamic 15 puzzle – Tasks that finish their work can pick up an unexplored state from queue of another busy task. Regular vs Irregular Interactions Regular if has some structure that can be used to obtain efficient implementation Otherwise, irregular. Ex: In sparse matrix-vector multiplication, must scan row of matrix to find out which of the vector entries are needed

Some Task Interactions Characteristics (cont)
Read-only vs Read-Write Data Sharing Read-only: Task only needs to read data shared with other tasks Ex: Matrix multiplication in Fig 3:10 Read-Write: Multiple tasks need to read and write to some shared data. Using heuristic search solution to solve 15 puzzle.

Mapping Tasks to Processors
A good mapping strives to achieve the following conflicting goals: Reducing the amount of time processor spend interacting with each other. Reducing the amount of total time that some processors are active while others are idle. Good mappings attempt to reduce the parallel processing overheads If Tp is the parallel runtime using p processors and Ts is the sequential runtime (for the same algorithm), then the the total overhead To is p×Tp – Ts. This is the work that is done by the parallel system that is beyond that required for the serial system.

Mapping Tasks to Processors (cont)
Two Main sources of overheads Load inbalance Results in process inactivity during execution Inter-process communications Coordination Synchronization Data-sharing Goal of mapping tasks to processes is to minimize the overheads. Goal of minimizing both of above overheads are often in conflict with each other.

Why Mappings can be Complicated
Mappings need to consider the task-dependency graph Are tasks available a priority? Static vs dynamic task generation Computation requirements factors Are they uniform or non-uniform Do we know tasks a priority How much data is associated with each task Mappings need to consider the task-interaction graph to determine the interactions between tasks Are they static or dynamic Do we know about them a priori Are they data instance dependent Are they regular or irregular Are they read-only or read-write? Depending on above characteristics, different mapping techniques are required with differing complexities and costs.

Simple & Complex Task Interactions Example
Consider the task-interaction graph for image dithering The color of each pixel is determined as weighted average of its original color and values of neighboring pixels If break image up into square regions and assign a different task to each, have simple task interactions Consider sparse matrix-vector graph. Assign i-th row and i-th vector value to i-th task. If j-th entry in i-th row is non-zero, then i-th row must obtain the j-th vector value from j-th task (unless i=j). Result is a complex task interaction graph.

Example: Simple & Complex Task Interactions

Mapping Techniques for Load Balancing
Problem: The assignment of tasks who total computational requirements are the same does not automatically ensure load balanced. Each processor below is assigned three tasks, but (a) is better than (b).

Load Balancing Techniques
Static Mapping The tasks are distributed among the processors prior to execution Applicable for tasks that are Generated statically Known and/or uniform computational requirements Optimal mapping for non-uniform tasks is NP-hard so requires a heuristic mapping for acceptable solutions Dynamic Mapping The tasks are distributed among the processors during the execution of the algorithm i.e., tasks & data are migrated during execution Applicable for tasks that are either Generated dynamically Unknown computational requirements

Static Mapping – Array Distribution
Suitable for algorithms that Use data decomposition Their underlying data is in the form of arrays i.e., input, output, or intermediate data Block Distribution Cyclic Distribution Block-Cyclic Distribution Randomized Distribution 1D/2D/3D

1D Block Distributions Partitioning a nm two-dimensional array along one dimension among p processes. Process k can be given the k-th block of n/p consecutive rows. i.e, row numbers kn/p, ... ,(k+1)n/p is given to process k. If n/p is not an integer, all processes except the last can be given a block of n/p rows and last process the remaining block of rows Alternately, the initial rows could receive n/p rows, and the rest receive n/p -1 rows Similarly, process k can be given the k-th block of m/p consecutive columns.

2D Block Distributions We could partition along more than one dimension. With a d-dimensional array, we can partition along up to d dimensions. If we have p process and p = p1p2, the p2, n we could partition an nn block into p subblocks of size n/p1 n/p2 and assign one to each process. The preceding 1D and 2D distributions are illustrated in the next slide.

Example: Block Distributions

Examples: Block Distributions for Matrix Multiplication

Block-Cyclic Distribution
Variation of the block distribution method. Can lead to a substantial more balanced work distribution. Central idea is to partition an k-dimensional array into many more blocks than the number of processes. Next, the partitions are assigned to processes (& associated tasks) in a round-robin manner Every process gets several non-adjacent blocks Some blocks may require substantially more work than others. If partitioning is fine enough, then all processes have a sampling of tasks from all parts of the original k-dimensional array. This increases the chances that the work for processes will balance out It also increases the chances that each process will have a process that is ready to execute at any particular time. A block-cyclic distribution example is given next

Example: Block-Cyclic Distribution

Randomized Block Distribution
When the distribution of work has some special pattern, block-cyclic distributions may fail to balance computation across processes. This is illustrated on next slide for a sparse matrix where the shaded area indicates areas of non-zero entries. Random block distribution can be used in situations like this to better balance the load on processes. The array again is partitioned into many more blocks than the number of processes. Each process receives an equal number of randomly selected blocks.

Random Block Distributions
Sometimes the contributions are performed only at certain portions of an array Sparse matrix-matrix multiplication

Random Block Distributions
Better load balance can be achieved via a random block distribution

Graphics Partitioning
The array-based distribution schemes are good at balancing the communications and minimizing the interactions for a wide range of algorithms. However, many algorithms operate on sparse data structures and have patterns of interactions that are irregular & data dependant Numerical simulations of physical phenomena involves computing the values of certain physical quantities at each mesh point. Computation at a mesh point usually involves the data for that point and for points that are adjacent in the graph Ideally, want to distribute mesh points in ways that balances the load & minimizes the amount of data that each process will need to access Next example involves levels of a water contamination at each mesh node.

Graph Partitioning A mapping can be achieved by directly partitioning the task interaction graph E.g., Finite element mesh-based computations

Directly Partitioning this Graph

Example: Sparse Matrix-Vector
Another instance of graph partitioning

Dynamic Load Balancing
Needed where static mapping produces unbalanced workload or else the task-dependency graph is dynamic Centralized Schemes All executable tasks are maintained by a special process or set of processes If a special process manages the pool of available tasks, then it is called the master and the other processes that carry out the work are called slaves When a process has no work, it takes a portion of the work from the central data structure or the master. When a new task is generated, it is added to central data structure or is reported to the master. Assigning too little work at a time can create a bottleneck. In chunk scheduling, a process without work is given a group of tasks. Too large a chunk can create load-imbalance. Also, chunk sizes must be reduced near end of run.

Dynamic Load Balancing (cont)
Distributed Schemes Executable tasks are distributed among processes Processes exchange tasks at run time to balance work. Each process can either send to or receive work from another process. Some important issues each scheme must handle How are sending & receiving processes paired Does sender or received initiate work transfer How much work is transferred each time. Must avoid work transfers being too small or too large. When is work transfer initiated? When process is out of work or when process anticipates running out of work.

Parallel Algorithm Models
The Data-Parallel Model The Task Graph Model Closely related to Foster’s Task/Channel Model Requires the task dependency graph that the Task/Channel model focuses on. Dependencies usually result from communications between two tasks Also requires the task-interaction graph, which also captures other interactions between tasks such as data sharing The Work Pool Model The Master-Slave Model The Pipeline or Producer-Consumer Model Hybrid Models

The Data Parallel Model
One of simplest models Tasks are statically or semi-statically mapped to processes. Each process performs similar operations on different data and is called data parallelism Typically, computation is interspersed with interactions to synchronize or to get fresh data. Decomposition is usually based on data partitioning. Uniform data partitioning and static assignment produces load balance

The Data Parallel Model (cont)
Can be used for both shared memory & message passing paradigms. Interaction overhead can be minimized by choosing a locality preserving decomposition Overlapping computation and communications, when possible For most problems, the degree of parallelism increases with the size of the problem Allows more processes to be used to solve larger problems Example: Dense matrix multiplication All tasks are identical in decomposition shown in Fig 3.10 but are applied to different data.

The Task Graph Model The computations in a parallel algorithm can be viewed as a task-dependency graph. Tasks are mapped to processes so that locality is promoted Volume and frequency of interactions are reduced Tasks usually mapped statically to help optimize the cost of data movement among tasks. Typically used to solve problems in which the data related to a task is rather large compared to the amount of computation. Asynchronous interaction methods are used to overlap interactions with computation

The Task Graph Model (cont.)
Examples of algorithms based on task graph model Parallel Quicksort (Section 9.4.1) Sparse Matrix Factorization Multiple parallel algorithms derived from divide-and-conquer decompositions. Task Parallelism The type of parallelism that is expressed by the independent tasks in a task-dependency graph.

The Work Pool Model Also called the “Task Pool Model”
Involves dynamic mapping of tasks onto processes for load balancing Any task may be potentially be performed by any process The mapping of tasks to processes can be centralized or decentralized. Pointers to tasks may be stored in a physically shared list, a priority queue, hash table, or tree a physically distributed data structure.

The Work Pool Model (cont.)
When work is generated dynamically and a decentralized mapping is used, then a termination detection algorithm is required When used with a message passing paradigm, normally the data required by the tasks is relatively small when compared to the computations Tasks can be readily moved around without causing too much data interaction overhead Granularity of tasks can be adjusted to obtain desired tradeoff between load imbalance and the overhead of adding and extracting tasks

The Work Pool Model (cont.)
Examples of algorithms based on the Work Pool Model Chunk-Scheduling

Master-Slave Model or Master-Worker
Also called the Manager-Worker model One or more master processes generate work and allocate it to workers Managers can allocate tasks in advance if they can estimate the size of tasks or if a random mapping can avoid load-balancing problems Normally, workers are assigned smaller tasks, as needed Work can be performed in phases Work in each phase is completed and workers synchronized before next phase is started. Normally, any worker can do any assigned task

Master-Slave Model (cont)
Can be generalized to a multi-level manager-worker model Top level managers feed large chunks of tasks to second-level managers Second-level managers subdivide tasks to their workers and may also perform some of the work Danger of manager becoming a bottleneck Can happen if tasks are too small Granularity of tasks should be chosen so that cost of doing work dominates cost of synchronization Waiting time may be reduced if worker requests are non-deterministic.

Master-Slave Model (cont)
Examples of algorithms based on the Master-Slave Model A master-slave example for centralized load-balancing is mentioned for centralized dynamic load balancing in Section (page 130) Several examples are given in textbook, Barry Wilkinson and Michael Allen, “Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers”, 1st or 2nd Edition,1999 & 2005, Prentice Hall.

Pipeline or Producer-Consumer Model
Similiar to the “linear array model” studied in Akl’s textbook. A stream of data is passed through a succession of processes, each of which performs some task on it. Called Stream Parallelism With exception of process initiating the work for the pipeline, Arrival of new data triggers the execution of a new task by a process in the pipeline. Each process can be viewed as a consumer of the data items produced by the process preceding it

Pipeline or Producer-Consumer Model (cont)
Each process in pipeline can be viewed as a producer of data for the process following it. The pipeline is a chain of producers and consumers The pipeline does not need to be a linear chain. Instead, it can be a directed graph. Process could form pipelines in form of Linear or multidimensional arrays Trees General graphs with or without cycles

Pipeline or Producer-Consumer Model (cont)
Load balancing is a function of task granularity With larger tasks, it takes longer to fill up the pipeline This keeps tasks waiting Too fine a granularity increases overhead, as processes will need to receive new data and initiate a new task after a small amount of computation Examples of algorithms based on this model A two-dimensional pipeline is used in the parallel LU factorization algorithm discussed in Section 8.3.1 An entire chapter is devoted to this model in previously mentioned textbook by Wilkinson & Allen.

Hybrid Models In some cases, more than one model may be used in designing an algorithm, resulting in a hybrid algorithm Parallel quicksort (Section and 9.4.1) is an application for which a hybrid model is ideal.

Multi-Tasking Models and Algorithms

Similar presentations

Presentation on theme: "Multi-Tasking Models and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-Tasking Models and Algorithms

Similar presentations

Presentation on theme: "Multi-Tasking Models and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback