Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Similar presentations


Presentation on theme: "Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies."— Presentation transcript:

1 Chapter 3 Parallel Algorithm Design

2 Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies Case studies

3 Task/Channel Model Parallel computation = set of tasks Parallel computation = set of tasks Task Task  Program  Local memory  Collection of I/O ports Tasks interact by sending messages through channels Tasks interact by sending messages through channels

4 Task/Channel Model Task Channel

5 Foster’s Design Methodology Partitioning Partitioning Communication Communication Agglomeration Agglomeration Mapping Mapping

6 Foster’s Methodology

7 Partitioning Dividing computation and data into pieces Dividing computation and data into pieces Domain decomposition Domain decomposition  Divide data into pieces  e.g., An array into sub-arrays (reduction); A loop into sub-loops (matrix multiplication), A search space into sub-spaces (chess)  Determine how to associate computations with the data Functional decomposition Functional decomposition  Divide computation into pieces  e.g., pipelines (floating point multiplication), workflows (pay roll processing)  Determine how to associate data with the computations

8 Example Domain Decompositions

9 Example Functional Decomposition

10 Partitioning Checklist Large Grained Tasks Large Grained Tasks  e.g, at least 10x more primitive tasks than processors in target computer Balance Load Balance Load  Primitive tasks roughly the same size Scalable Scalable  Number of tasks an increasing function of problem size

11 Communication Determine values passed among tasks Determine values passed among tasks Local communication Local communication  Task needs values from a small number of other tasks  Create channels illustrating data flow Global communication Global communication  Significant number of tasks contribute data to perform a computation  Don’t create channels for them early in design

12 Communication Checklist Balanced Balanced  Communication operations balanced among tasks Small degree: Small degree:  Each task communicates with only small group of neighbors Concurrency Concurrency  Tasks can perform communications concurrently  Task can perform computations concurrently

13 Agglomeration Grouping tasks into larger tasks Grouping tasks into larger tasks Goals Goals  Improve performance  Maintain scalability of program  Simplify programming In MPI programming, goal often to create one agglomerated task per processor In MPI programming, goal often to create one agglomerated task per processor

14 Agglomeration Can Improve Performance Eliminate communication between primitive tasks agglomerated into consolidated task Eliminate communication between primitive tasks agglomerated into consolidated task Combine groups of sending and receiving tasks Combine groups of sending and receiving tasks

15 Agglomeration Checklist Locality of parallel algorithm has increased Locality of parallel algorithm has increased Tradeoff between agglomeration and code modifications costs is reasonables Tradeoff between agglomeration and code modifications costs is reasonables Agglomerated tasks have similar computational and communications costs Agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks increases with problem size Number of tasks suitable for likely target systems Number of tasks suitable for likely target systems

16 Mapping Process of assigning tasks to processors Process of assigning tasks to processors Centralized multiprocessor: mapping done by operating system Centralized multiprocessor: mapping done by operating system Distributed memory system: mapping done by user Distributed memory system: mapping done by user Conflicting goals of mapping Conflicting goals of mapping  Maximize processor utilization  Minimize interprocessor communication

17 Mapping Example

18 Optimal Mapping Finding optimal mapping is NP-hard Finding optimal mapping is NP-hard Must rely on heuristics Must rely on heuristics

19 Mapping Decision Tree Static number of tasks Static number of tasks  Structured communication  Constant computation time per task Agglomerate tasks to minimize commAgglomerate tasks to minimize comm Create one task per processorCreate one task per processor  Variable computation time per task Cyclically map tasks to processorsCyclically map tasks to processors  Unstructured communication Use a static load balancing algorithmUse a static load balancing algorithm Dynamic number of tasks Dynamic number of tasks

20 Mapping Strategy Static number of tasks Static number of tasks Dynamic number of tasks Dynamic number of tasks  Use a run-time task-scheduling algorithm e.g., a master slave strategye.g., a master slave strategy  Use a dynamic load balancing algorithm e.g., share load among neighboring processors; remapping periodicallye.g., share load among neighboring processors; remapping periodically

21 Mapping Checklist Considered designs based on one task per processor and multiple tasks per processor Considered designs based on one task per processor and multiple tasks per processor  If multiple task per processor chosen, ratio of tasks to processors is at least 10:1 Evaluated static and dynamic task allocation Evaluated static and dynamic task allocation If dynamic task allocation chosen, task allocator is not a bottleneck to performance If dynamic task allocation chosen, task allocator is not a bottleneck to performance

22 Case Studies Boundary value problem Boundary value problem Finding the maximum Finding the maximum The n-body problem The n-body problem Adding data input Adding data input

23 Boundary Value Problem Ice waterRodInsulation

24 Rod Cools as Time Progresses

25 Finite Difference Approximation

26 Partitioning One data item per grid point One data item per grid point Associate one primitive task with each grid point Associate one primitive task with each grid point Two-dimensional domain decomposition Two-dimensional domain decomposition

27 Communication Identify communication pattern between primitive tasks Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels Each interior primitive task has three incoming and three outgoing channels

28 Agglomeration and Mapping Agglomeration

29 Sequential execution time  – time to update element  – time to update element n – number of elements n – number of elements m – number of iterations m – number of iterations Sequential execution time: mn  Sequential execution time: mn 

30 Parallel Execution Time p – number of processors p – number of processors – message latency – message latency Parallel execution time m(  n/p  +2 ) Parallel execution time m(  n/p  +2 )

31 Finding the Maximum Error 6.25%

32 Reduction Given associative operator  Given associative operator  a 0  a 1  a 2  …  a n-1 a 0  a 1  a 2  …  a n-1 Examples Examples  Add  Multiply  And, Or  Maximum, Minimum

33 Parallel Reduction Evolution

34

35

36 Binomial Trees Subgraph of hypercube

37 Finding Global Sum 4207 -35-6-3 8123 -446

38 Finding Global Sum 17-64 4582

39 Finding Global Sum 8-2 910

40 Finding Global Sum 178

41 Finding Global Sum 25 Binomial Tree

42 Agglomeration

43 sum

44 The n-body Problem

45

46 Partitioning Domain partitioning Domain partitioning Assume one task per particle Assume one task per particle Task has particle’s position, velocity vector Task has particle’s position, velocity vector Iteration Iteration  Get positions of all other particles  Compute new position, velocity

47 Gather

48 All-gather

49 Complete Graph for All-gather

50 Hypercube for All-gather

51 Communication Time Hypercube Complete graph

52 Adding Data Input

53 Scatter

54 Scatter in log p Steps 12345678 567812345612 7834

55 Summary: Task/channel Model Parallel computation Parallel computation  Set of tasks  Interactions through channels Good designs Good designs  Maximize local computations  Minimize communications  Scale up

56 Summary: Design Steps Partition computation Partition computation Agglomerate tasks Agglomerate tasks Map tasks to processors Map tasks to processors Goals Goals  Maximize processor utilization  Minimize inter-processor communication

57 Summary: Fundamental Algorithms Reduction Reduction Gather and scatter Gather and scatter All-gather All-gather


Download ppt "Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies."

Similar presentations


Ads by Google