Download presentation
Presentation is loading. Please wait.
Published byMaurice Rich Modified over 9 years ago
1
Programming Parallel Algorithms - NESL Guy E. Blelloch Presented by: Michael Sirivianos Barbara Theodorides
2
Problem Statement Why design a new language specifically for programming parallel algorithms? In the past 20 years there has been tremendous progress in developing and analyzing parallel algorithms At that time less success in developing good languages for programming parallel algorithms There is a large gap between languages that are too low level (details that obscure the meaning of the algorithm) and languages that are too high level (making performance implications unclear)
3
NESL Nested Data Parallel Language Useful for teaching and implementing parallel algorithms. Bridges the gap: allows high-level descriptions of parallel algorithms but also has a straightforward mapping onto a performance model. Goals when designing NESL: A language-based performance model that uses work and depth rather than a machine-based model that uses running time Support for nested data-parallel constructs (ability to nest parallel calls)
4
Analyzing performance Processor-based models: Performance is calculated in terms of the number of instruction cycles a computation takes (its running time) ~ A function of input size and number of processors Virtual models: Higher level models that can be mapped onto various real machines (e.g. PRAM - Parallel Random Access Machine) Can be mapped efficiently onto more realistic machines by simulating multiple processors of the PRAM on a single processor of a host machine. Virtual models easier to program.
5
Measuring performance: Work & Depth Work: the total number of operations executed by a computation specifies the running time on a sequential processor Depth: the longest chain of sequential dependencies in the computation. represents the best possible running time assuming an ideal machine with an unlimited number of processors Example: Summing 16 numbers using a balanced binary tree
6
How can work & depth be incorporated into a computational model? Circuit model Designing a circuit of logic gates In previous example, design a circuit in which the inputs are at the top, each “+” is an adder circuit, and each of the lines between adders is a bundle of wires. Work = circuit size (number of gates) Depth = longest path from an input to an output
7
How can work & depth be incorporated into a computational model? (cont) Vector Machine Models VRAM is a sequential RAM extended with a set of instructions that operate on vectors. Each location in memory contains a whole vector Vectors can vary in size during the computation Vector instructions include element wise operations (adding corresponding elements) Depth = #instructions executed by the machine Work = sum of the lengths of the vectors
8
How can work & depth be incorporated into a computational model? (cont) Vector Machine Models Example Summation tree code Work = O ( n + n/2 + … ) = O (n) Depth = O (log n)
9
How can work & depth be incorporated into a computational model? (cont) Language-Based Models Specify the costs of the primitive instructions and a set of rules for composing costs across program expressions. Discuss the running time of the algorithms without introducing a specific machine model. Using work & depth: work & depth costs are assigned to each function and scalar primitive of a language and rules are specified for combining parallel and sequential expressions. Roughly speaking, when executing a set of tasks in parallel: work = sum of work of the tasks depth = maximum of the depth of the tasks
10
Why Work & Depth? Work & Depth: used informally for many years to describe the performance of parallel algorithms easier to describe easier to think about easier to analyze algorithms in terms of work & depth than in terms of running time and number of processors (processor-based model) Why models based on work & depth are better than processor-based models for programming and analyzing parallel algorithms? Performance analysis is closely related to the code and code provides a clear abstraction of parallelism.
11
Why Work & Depth? (cont) To support this claim they consider Quicksort. Sequential algorithm: Average case: run time = O ( n log n ), depth or recur. calls = O ( log n ) Parallel algorithm:
12
Quicksort (cont.) Code and analysis based on a processor based model Code will have to specify how the sequence is partitioned across processor how the subselection is implemented in parallel how the recursive calls get partitioned among the processors. how the subcalls are synchronized In the case of Quicksort, this gets even more complicated. T The recursive calls are not of equal sizes.
13
Work & Depth and running time Running time at the two limits: Single processor. RT = work Unlimited number of processors. RT = depth We can place upper and lower bounds for a given number of processor. W/ P <= T <= W / P + D valid under assumptions about communication and scheduling costs. e.g. given memory latency L W/ P <= T <= W / P + L*D Communication cost among processor is not unit time thus D is multiplied by a latency factor. Bandwidth is not taken into account. In case of significantly different bandwidth W should be divided by a large B factor and D by a small B factor.
14
Work & Depth and running time (cont) Communication Bounds Work & depth do not take into account communication costs: latency: time between making a remote request and receiving the reply bandwidth: rate at which a processor can access memory Latency can be hidden. Each processor has multiple parallel tasks (threads) to execute and therefore has plenty to do while waiting for replies Bandwidth can not be hidden. While processor is waiting for data transfer to complete it is not able to perform other operations, and therefore remains idle..
15
Nested Data-Parallelism and NESL Data-Parallelism: the ability to operate in parallel over sets of data Data-Parallel Languages or Collection-Oriented Languages: languages based on data-parallelism. Can be either flat or nested Importance of nested parallelism: Used to implement nested loops and divide-and-conquer algorithms in parallel Existing languages, such as C, do not have direct support for such nesting! NESL Is a nested data-parallel language. Designed in order to express nested parallelism in a simple way with a minimum set of structures
16
NESL Supports data-parallelism by means of operations on sequences Apply-to-each construct which uses a set-like notation e.g. {a * a : a in [3, -4, -9, 5]}; Used over multiple sequences. {a + b : a in [3, -4, -9, 5]; b in [1, 2, 3, 4]}; Ability to subselect elements of a sequence based on a filter. e.g. {a * a : a in [3, -4, -9, 5] | a > 0}; Any function may be applied to each element of a sequence e.g. {factorial(i) : i in [3, 1, 7]}; Provides a set of functions on sequences, each of which can be implemented in parallel (sum, reverse, write) e.g. write([0, 0, 0, 0, 0, 0, 0, 0], [(4,-2),(2,5),(5,9)]); Nested parallelism: allow sequences to be nested and allow parallel funcitons to be used in an apply-to-each. e.g. {sum(a) : a in [[2,3], [8,3,9], [7]]};
17
The performance Model Defines Work & Depth in terms of the work and depth of the primitive operations, and Rules for composing the measures across expressions. In most cases: W(e 1 + e 2 ) = 1 + W(e 1 ) + W(e 2 ), where e i : expresions A similar rule is used for the depth. Rules apply-to-each expression: if expression:
18
The performance Model (cont) Example: Factorial Concider the evaluation of the expression: e = {factorial(n) : n in a} where a = [3, 1, 5, 2]. function factorial(n) = if (n == 1) then 1 else n*factorial(n-1); Using the rules for work and depth: where W = =, W *, W - have cost 1. The two unit constants come form the cost of the function call and the if- then-else statement.
19
Examples of Parallel Algorithms in NESL Principles: An important aspect of developing a good parallel algorithm is designing one whose work is close to the time for a good sequential algorithm that solves the same problem. Work-efficient: Parallel algorithms are referred to as work-efficient relative to a sequential algorithm if their work is within a constant factor of the time of the sequential algorithm.
20
Examples of Parallel Algorithms in NESL (cont) Primes Sieve of Eratosthenes: 1 procedure PRIMES(n): 2 let A be an array of length n 3 set all but the first element of A to TRUE 4 for i from 2 to sqrt(n) 5begin 6 if A[i] is TRUE 7 then set all multiples of i up to n to FALSE 8end Line 7 is implementing by looping over the multiples, thus the algorithm takes O (n log log n) time.
21
Examples of Parallel Algorithms in NESL (cont) Primes (parallelized) Parallelize the line “set all multiples of i up to n to FALSE” multiples of a value i can be generated in parallel by [2*i:n:i] and can be written into the array A in parallel with the write function The depth of this algorithm is O (sqrt(n)), since each iteration of the loop has constant depth and there are sqrt(n) iterations. The number of multiples is the same as the time of the sequential version. Since it does the same number of operations, work is the same O (n log log n).
22
Examples of Parallel Algorithms in NESL (cont) Primes: Improving depth If we are given all the primes form 2 up to sqrt(n), we could then generate all the multiples of these primes at once: {[2*p:n:p] : in sqr_primes} function primes (n) = if n == 2 then ( [ ] int ) else let sqr_primes = primes( isqrt(n) ); composites = {[2*p:n:p] : p in sqr_primes}; flat_comps = flatten (composites); flags = write(dist(true, n), {(i,false) : i in flat_comps}); indices = {i in [0:n]; fl in flags | fl} in drop(indices, 2);
23
Examples of Parallel Algorithms in NESL (cont) Primes: Improving depth Analyze of Work & Depth: Work: clearly most of the work is done at the top level of recursion, which does O (n log log n) work, and therefore the total work is O (n log log n) Depth: since each recursion level has constant depth, the total depth is proportional to the number of levels. The number of levels is log log n (the size of the problem at the i th level is n 1/2^d => d = log log n) and therefore the depth is O (log log n) This algorithm remains work-efficient and greatly improves the depth.
24
Examples of Parallel Algorithms in NESL (cont) Sparce Matrix Multiplication Sparce matices: most elements are zero Representation in NESL: 2.0 -1.0 0 0 A = [[(0, 2.0), (1, -1.0)], A = -1.0 2.0 -1.0 0 [(0, -1.0), (1, 2.0), (2, -1.0)], 0 -1.0 2.0 -1.0 [(1, -1.0), (2, 2.0), (3, -1.0)], 0 0 -1.0 2.0 [(2, -1.0), (3, 2.0)]] E.g. multiply a sparce matrix A with a dense vector x. The dot product Ax in NESL is: {sum({v * x[i] : (i,v) in row}) : row in A}; Let n be the number of nonzero elements in the row, then depth of the computation = the depth of the sum = O ( log n ) work = sum of the work across the elements = O (n)
25
Examples of Parallel Algorithms in NESL (cont) Planar Convex Hull Problem: Given n points in the plane, find which of them lie on the perimeter of the smallest convex region that contains all points. An example of nested parallelism for divide-and-conquer algorithms. Quickhull algorithm (similar to Quicksort): The strategy is to pick a pivot element, split the data based on the pivot, and recurse on each of the split sets. Worst case performance is O (n 2 ) and the worst case depth is O (n).
26
Examples of Parallel Algorithms in NESL (cont) hsplit(set,A,P) & hsplit(set,P,A) cross product (p, (A,P)) pm: farthest from line A-P Recursively: hsplit(set’,A,pm) hsplit(set’,pm,P) Ignores elements below the line
27
Examples of Parallel Algorithms in NESL (cont) Performance analysis of Quickhull: Each recursive call has constant depth and O(n) work. However, since many points might be deleted on each step, the work could be significantly less. As in Quicksort, worst case performance is O (n 2 ) and the worst case depth is O (n). For m hull points the best case times are O (n) work and O( log m ) depth.
28
Summary They formalize a clear-cut formal language-based model for analyzing performance Work & depth based model is directly defined through a programming language, rather than a specific machine It can be applied to various classes of machines using mappings that count for number of processors, processing and communication costs. NESL allows simple description of parallel algorithms and makes use of data parallel constructs and the ability to nest such constructs..
29
Summary NESL hides the CPU/Memory allocation, and inter- processor communication details by providing an abstraction of parallelism. The current NESL implementation is based on an intermediate language (VCODE )and a library of low level vector routines (CVL) For more information on how NESL compiler is implemented: “Implementation of a Portable Nested Data-Parallel Language”Guy E. Blelloch, Siddhartha Chatterjee, Jonathan C. Hardwick, Jay Sipelstein, and Marco Zagha.
30
Discussion Parallel Processing - Sensor Network Analogy: Local processing -> Aggregation. Work corresponds to total aggregation cost. Moving levels up -> Collecting aggregated results from children nodes. Depth->Depth of routing tree in sensor network. Implies communication cost. Latency->Cost to transmit data between motes. In parallel computation the goal is to reduce execution time. Sensor networks aim to reduce power consumption by minimizing communications. Execution time is also an issue when real time requirements are imposed.
31
Discussion NESL and TAG queries? Can latency be hidden by assigning multiple tasks to motes? Can you perform different operations on an array's elements in parallel? Is it hard to add one more parallelism mechanism besides apply-to-each and parallel functions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.