Download presentation
Presentation is loading. Please wait.
Published byJoshua Gilmore Modified over 9 years ago
1
Parallel Processing (CS526) Spring 2012(Week 5)
2
There are no rules, only intuition, experience and imagination! We consider design techniques, particularly top-down approaches, in which we find the main structure first, using a set of useful conceptual paradigms. We also look for useful primitives to compose in a bottom-up approach.
3
A parallel algorithm emerges from a process in which a number of interlocked issues must be addressed: Where do the basic units of computation (tasks) come from? ◦ This is sometimes called “partitioning" or “decomposition". Sometimes it is natural to think in terms of partitioning the input or output data (or both). On other occasions a functional decomposition may be more appropriate (i.e. thinking in terms of a collection of distinct, interacting activities).
4
How do the tasks interact? ◦ We have to consider the dependencies between tasks (dependency, interaction graphs). Dependencies will be expressed in implementations as communication, synchronisation and sharing(depending upon the machine model). Are the natural tasks of a suitable granularity? ◦ Depending upon the machine, too many small tasks may incur high overheads in their interaction. Should they be agglomerated (collected together) into super-tasks? This is related to scaling-down.
5
How should we assign tasks to processors? ◦ Again, in the presence of more tasks than processors, this is related to scaling down. The owner computes rule is natural for some algorithms which have been devised with a data-oriented partitioning. We need to ensure that tasks which interact can do so as (quickly) as possible. These issues are often in tension with each other
6
Use recursive problem decomposition. Create sub-problems of the same kind, which are solved recursively. Combine sub-solutions to solve the original problem. Define a base case for direct solution of simple instances. Well-known examples include quick-sort, merge-sort, matrix multiply.
7
© 2010 Goodrich, Tamassia 7 Merge-sort on an input sequence S with n elements consists of three steps: ◦ Divide: partition S into two sequences S 1 and S 2 of about n 2 elements each ◦ Recur: recursively sort S 1 and S 2 ◦ Conquer: merge S 1 and S 2 into a unique sorted sequence Algorithm mergeSort(S, C) Input sequence S with n elements, comparator C Output sequence S sorted according to C if S.size() > 1 (S 1, S 2 ) partition(S, n/2) mergeSort(S 1, C) mergeSort(S 2, C) S merge(S 1, S 2 )
8
© 2010 Goodrich, Tamassia8 The conquer step of merge-sort consists of merging two sorted sequences A and B into a sorted sequence S containing the union of the elements of A and B Merging two sorted sequences, each with n 2 elements takes O(n) time I.e Mergesort have a sequential complexity of Ts=O(nlogn) Algorithm merge(A, B) Input sequences A and B with n 2 elements each Output sorted sequence of A B S empty sequence while A.isEmpty() B.isEmpty() if A.first().element() < B.first().element() S.addLast(A.remove(A.first())) else S.addLast(B.remove(B.first())) while A.isEmpty() S.addLast(A.remove(A.first())) while B.isEmpty() S.addLast(B.remove(B.first())) return S
9
© 2010 Goodrich, Tamassia9 An execution of merge-sort is depicted by a binary tree ◦ each node represents a recursive call of merge-sort and stores unsorted sequence before the execution and its partition sorted sequence at the end of the execution ◦ the root is the initial call ◦ the leaves are calls on subsequences of size 0 or 1 7 2 9 4 2 4 7 9 7 2 2 79 4 4 9 7 72 29 94 4
10
There is an obvious tree of processes to be mapped to available processors. There may be a sequential bottleneck at the root. Producing an efficient algorithm may require us to parallelize it, producing nested parallelism. Small problems may not be worth distributing trade on between distribution costs and re- computation costs.
11
Algorithm Parallel_mergeSort(S, C) Input sequence S with n elements, comparator C Output sequence S sorted ◦ according to C if S.size() > 1 (S 1, S 2 ) partition(S, n/2) initiate a process to invoke mergeSort(S 1, C) mergeSort(S 2, C) ============ Sync all invoked processes ============ S merge(S 1, S 2 )
12
The serial runtime for the Merge sort can be expressed as: ◦ Ts(n)=nlogn+n = O(nlogn) Parallel runtime for Merge-sort on n processors: Tp(n)=logn+n = O(n) S= O(logn) Cost = n 2 is it cost optimal??
13
We have a parallelism bottleneck (Merge) ◦ can we Parallelize it? We have two sorted lists Searching in a sorted list is best done using a divide an conquer (binary search) O(logn) If we partition the two lists on the middle element and merge the two lists depending on a binary search technique we could reduce the merge operation to a O(logn 2 )
14
ListA Parallel_Merge(S1,S2) { if (length of any S1 or S2 ==1) {add the element in the shortest list to the other list and return the resulting list.} Else TS1 all elements from 0 to S1.length/2 TS2 all elements from S1.length/2 to S1.length-1 TS3 all elements from 0 to S2.length/2 TS4 all elements from S2.length/2 to S2.length-1 in parallel merge these list according to a comparison between the splitting elements calling recursively Parallel merge function for each process which will return the two partitions S1,S2 sorted as S1<S2 Sync A S1+S2 }
15
2468 10 135792468 13579 798 13 5 24 6 9 789 7856 1324 12341234569 78123456
16
The serial runtime for the Merge sort can be expressed as: ◦ Ts(n)=nlogn+n = O(nlogn) Parallel runtime for Merge-sort on n processors: Tp(n)=logn+logn = O(logn) S= O(n) Cost = nlogn is it cost optimal?? Read the Paper on the website for another parallel merge-sort algorithm
17
We ignored a very important overhead cost which is communication cost. Think of the implementation of this algorithm if it was on a message passing environment. The analysis of an algorithm must take into account the underlying platform in which it will operate. What about the merge sort what is the cost if it was on a message passing parallel archetecture?
19
In reality its some times difficult or inefficient to assign tasks to processing elements at the design time. The Bag of Tasks pattern suggests an approach which may be able to reduce overhead, while still providing the flexibility to express such dynamic, unpredictable parallelism. In bag of tasks, a fixed number of worker processes/threads maintain and process a dynamic collection of homogeneous “tasks". Execution of a particular task may lead to the creation of more task instances.
20
place initial task(s) in bag; co [w = 1 to P] { while (all tasks not done) { get a task; execute the task; possibly add new tasks to the bag; } The pattern is naturally load-balanced: each worker will probably complete a different number of tasks, but will do roughly the same amount of work overall.
21
The Producers-Consumers pattern arises when a group of activities generate data which is consumed by another group of activities. The key characteristic is that the conceptual data flow is all in one direction, from producer(s) to consumer(s). In general, we want to allow production and consumption to be loosely synchronized, so we will need some buffering in the system. The programming challenges are to ensure that no producer overwrites a buffer entry before a consumer has used it, and that no consumer tries to consume an entry which doesn't really exist (or re-use an already consumed entry).
22
Depending upon the model, these challenges motivate the need for various facilities. For example, with a buffer in shared address space, we may need atomic actions and condition synchronization. Similarly, in a distributed implementation we want to avoid tight synchronization between sends to the buffer and receives from it.
23
When one group of consumers become the producers for yet another group, we have a pipeline. Items of data flow from one end of the pipeline to the other, being transformed by and/or transforming the state of the pipeline stage processes as they go.
24
The Sieve of Eratosthenes provides a simple pipeline example, with the additional factor that we build the pipeline dynamically. The object is to find all prime numbers in the range 2 to N. The gist of the original algorithm was to write down all integers in the range, then repeatedly remove all multiples of the smallest remaining number. After each removal phase, the new smallest remaining number is guaranteed to be prime We will implement a message passing pipelined parallel version by creating a generator process and a sequence of sieve processes, each of which does the work of one removal phase. The pipeline grows dynamically, creating new sieves on demand, as unsieved numbers emerge from the pipeline.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.