Programming Parallel Algorithms - NESL Guy E. Blelloch Presented by: Michael Sirivianos Barbara Theodorides.

Slides:



Advertisements
Similar presentations
Analysis of Computer Algorithms
Advertisements

MATH 224 – Discrete Mathematics
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
HST 952 Computing for Biomedical Scientists Lecture 10.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Advanced Topics in Algorithms and Data Structures Lecture 7.1, page 1 An overview of lecture 7 An optimal parallel algorithm for the 2D convex hull problem,
Quicksort CS 3358 Data Structures. Sorting II/ Slide 2 Introduction Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case:
Quicksort COMP171 Fall Sorting II/ Slide 2 Introduction * Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case: O(N.
CSC401 – Analysis of Algorithms Lecture Notes 1 Introduction
Reference: Message Passing Fundamentals.
Introduction to Analysis of Algorithms
Data Parallel Algorithms Presented By: M.Mohsin Butt
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Complexity Analysis (Part I)
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Analysis of Algorithms1 Estimate the running time Estimate the memory space required. Time and space depend on the input size.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
CS Main Questions Given that the computer is the Great Symbol Manipulator, there are three main questions in the field of computer science: What kinds.
Introduction to Analysis of Algorithms Prof. Thomas Costello (reorganized by Prof. Karen Daniels)
Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.
The Complexity of Algorithms and the Lower Bounds of Problems
Describing Syntax and Semantics
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Chapter 1 Program Design
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
Analysis of Algorithms Spring 2015CS202 - Fundamentals of Computer Science II1.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
Data Structures and Algorithms Lecture 5 and 6 Instructor: Quratulain Date: 15 th and 18 th September, 2009 Faculty of Computer Science, IBA.
Algorithm Efficiency CS 110: Data Structures and Algorithms First Semester,
Unit III : Introduction To Data Structures and Analysis Of Algorithm 10/8/ Objective : 1.To understand primitive storage structures and types 2.To.
Chapter 12 Recursion, Complexity, and Searching and Sorting
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Analysis of Algorithms
Order Statistics. Order statistics Given an input of n values and an integer i, we wish to find the i’th largest value. There are i-1 elements smaller.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Data Structures and Algorithms Introduction to Algorithms M. B. Fayek CUFE 2006.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Data Structures and Algorithms in Parallel Computing Lecture 1.
1/6/20161 CS 3343: Analysis of Algorithms Lecture 2: Asymptotic Notations.
Introduction to Algorithms (2 nd edition) by Cormen, Leiserson, Rivest & Stein Chapter 2: Getting Started.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
LECTURE 4 Logic Design. LOGIC DESIGN We already know that the language of the machine is binary – that is, sequences of 1’s and 0’s. But why is this?
CSC 212 – Data Structures Lecture 15: Big-Oh Notation.
Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.
LECTURE 22: BIG-OH COMPLEXITY CSC 212 – Data Structures.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Algorithm Analysis 1.
Introduction to Algorithms
Analysis of Algorithms
Analysis of Algorithms
Algorithm Analysis CSE 2011 Winter September 2018.
Course Description Algorithms are: Recipes for solving problems.
Recursion "To understand recursion, one must first understand recursion." -Stephen Hawking.
Algorithm Analysis (not included in any exams!)
Data Structures and Algorithms in Parallel Computing
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Analysis of Algorithms
Introduction to Algorithms
Course Description Algorithms are: Recipes for solving problems.
Analysis of Algorithms
Presentation transcript:

Programming Parallel Algorithms - NESL Guy E. Blelloch Presented by: Michael Sirivianos Barbara Theodorides

Problem Statement Why design a new language specifically for programming parallel algorithms? In the past 20 years there has been tremendous progress in developing and analyzing parallel algorithms At that time less success in developing good languages for programming parallel algorithms There is a large gap between languages that are too low level (details that obscure the meaning of the algorithm) and languages that are too high level (making performance implications unclear)

NESL Nested Data Parallel Language Useful for teaching and implementing parallel algorithms. Bridges the gap: allows high-level descriptions of parallel algorithms but also has a straightforward mapping onto a performance model. Goals when designing NESL: A language-based performance model that uses work and depth rather than a machine-based model that uses running time Support for nested data-parallel constructs (ability to nest parallel calls)

Analyzing performance Processor-based models: Performance is calculated in terms of the number of instruction cycles a computation takes (its running time) ~ A function of input size and number of processors Virtual models: Higher level models that can be mapped onto various real machines (e.g. PRAM - Parallel Random Access Machine) Can be mapped efficiently onto more realistic machines by simulating multiple processors of the PRAM on a single processor of a host machine. Virtual models easier to program.

Measuring performance: Work & Depth Work: the total number of operations executed by a computation specifies the running time on a sequential processor Depth: the longest chain of sequential dependencies in the computation. represents the best possible running time assuming an ideal machine with an unlimited number of processors Example: Summing 16 numbers using a balanced binary tree

How can work & depth be incorporated into a computational model? Circuit model Designing a circuit of logic gates In previous example, design a circuit in which the inputs are at the top, each “+” is an adder circuit, and each of the lines between adders is a bundle of wires. Work = circuit size (number of gates) Depth = longest path from an input to an output

How can work & depth be incorporated into a computational model? (cont) Vector Machine Models VRAM is a sequential RAM extended with a set of instructions that operate on vectors. Each location in memory contains a whole vector Vectors can vary in size during the computation Vector instructions include element wise operations (adding corresponding elements) Depth = #instructions executed by the machine Work = sum of the lengths of the vectors

How can work & depth be incorporated into a computational model? (cont) Vector Machine Models Example Summation tree code Work = O ( n + n/2 + … ) = O (n) Depth = O (log n)

How can work & depth be incorporated into a computational model? (cont) Language-Based Models Specify the costs of the primitive instructions and a set of rules for composing costs across program expressions. Discuss the running time of the algorithms without introducing a specific machine model. Using work & depth: work & depth costs are assigned to each function and scalar primitive of a language and rules are specified for combining parallel and sequential expressions. Roughly speaking, when executing a set of tasks in parallel: work = sum of work of the tasks depth = maximum of the depth of the tasks

Why Work & Depth? Work & Depth: used informally for many years to describe the performance of parallel algorithms easier to describe easier to think about easier to analyze algorithms in terms of work & depth than in terms of running time and number of processors (processor-based model) Why models based on work & depth are better than processor-based models for programming and analyzing parallel algorithms? Performance analysis is closely related to the code and code provides a clear abstraction of parallelism.

Why Work & Depth? (cont) To support this claim they consider Quicksort. Sequential algorithm: Average case: run time = O ( n log n ), depth or recur. calls = O ( log n ) Parallel algorithm:

Quicksort (cont.) Code and analysis based on a processor based model Code will have to specify how the sequence is partitioned across processor how the subselection is implemented in parallel how the recursive calls get partitioned among the processors. how the subcalls are synchronized In the case of Quicksort, this gets even more complicated. T The recursive calls are not of equal sizes.

Work & Depth and running time Running time at the two limits: Single processor. RT = work Unlimited number of processors. RT = depth We can place upper and lower bounds for a given number of processor. W/ P <= T <= W / P + D valid under assumptions about communication and scheduling costs. e.g. given memory latency L W/ P <= T <= W / P + L*D Communication cost among processor is not unit time thus D is multiplied by a latency factor. Bandwidth is not taken into account. In case of significantly different bandwidth W should be divided by a large B factor and D by a small B factor.

Work & Depth and running time (cont) Communication Bounds Work & depth do not take into account communication costs: latency: time between making a remote request and receiving the reply bandwidth: rate at which a processor can access memory Latency can be hidden. Each processor has multiple parallel tasks (threads) to execute and therefore has plenty to do while waiting for replies Bandwidth can not be hidden. While processor is waiting for data transfer to complete it is not able to perform other operations, and therefore remains idle..

Nested Data-Parallelism and NESL Data-Parallelism: the ability to operate in parallel over sets of data Data-Parallel Languages or Collection-Oriented Languages: languages based on data-parallelism. Can be either flat or nested Importance of nested parallelism: Used to implement nested loops and divide-and-conquer algorithms in parallel Existing languages, such as C, do not have direct support for such nesting! NESL Is a nested data-parallel language. Designed in order to express nested parallelism in a simple way with a minimum set of structures

NESL Supports data-parallelism by means of operations on sequences Apply-to-each construct which uses a set-like notation e.g. {a * a : a in [3, -4, -9, 5]}; Used over multiple sequences. {a + b : a in [3, -4, -9, 5]; b in [1, 2, 3, 4]}; Ability to subselect elements of a sequence based on a filter. e.g. {a * a : a in [3, -4, -9, 5] | a > 0}; Any function may be applied to each element of a sequence e.g. {factorial(i) : i in [3, 1, 7]}; Provides a set of functions on sequences, each of which can be implemented in parallel (sum, reverse, write) e.g. write([0, 0, 0, 0, 0, 0, 0, 0], [(4,-2),(2,5),(5,9)]); Nested parallelism: allow sequences to be nested and allow parallel funcitons to be used in an apply-to-each. e.g. {sum(a) : a in [[2,3], [8,3,9], [7]]};

The performance Model Defines Work & Depth in terms of the work and depth of the primitive operations, and Rules for composing the measures across expressions. In most cases: W(e 1 + e 2 ) = 1 + W(e 1 ) + W(e 2 ), where e i : expresions A similar rule is used for the depth. Rules apply-to-each expression: if expression:

The performance Model (cont) Example: Factorial Concider the evaluation of the expression: e = {factorial(n) : n in a} where a = [3, 1, 5, 2]. function factorial(n) = if (n == 1) then 1 else n*factorial(n-1); Using the rules for work and depth: where W = =, W *, W - have cost 1. The two unit constants come form the cost of the function call and the if- then-else statement.

Examples of Parallel Algorithms in NESL Principles: An important aspect of developing a good parallel algorithm is designing one whose work is close to the time for a good sequential algorithm that solves the same problem. Work-efficient: Parallel algorithms are referred to as work-efficient relative to a sequential algorithm if their work is within a constant factor of the time of the sequential algorithm.

Examples of Parallel Algorithms in NESL (cont) Primes Sieve of Eratosthenes: 1 procedure PRIMES(n): 2 let A be an array of length n 3 set all but the first element of A to TRUE 4 for i from 2 to sqrt(n) 5begin 6 if A[i] is TRUE 7 then set all multiples of i up to n to FALSE 8end Line 7 is implementing by looping over the multiples, thus the algorithm takes O (n log log n) time.

Examples of Parallel Algorithms in NESL (cont) Primes (parallelized) Parallelize the line “set all multiples of i up to n to FALSE” multiples of a value i can be generated in parallel by [2*i:n:i] and can be written into the array A in parallel with the write function The depth of this algorithm is O (sqrt(n)), since each iteration of the loop has constant depth and there are sqrt(n) iterations. The number of multiples is the same as the time of the sequential version. Since it does the same number of operations, work is the same O (n log log n).

Examples of Parallel Algorithms in NESL (cont) Primes: Improving depth If we are given all the primes form 2 up to sqrt(n), we could then generate all the multiples of these primes at once: {[2*p:n:p] : in sqr_primes} function primes (n) = if n == 2 then ( [ ] int ) else let sqr_primes = primes( isqrt(n) ); composites = {[2*p:n:p] : p in sqr_primes}; flat_comps = flatten (composites); flags = write(dist(true, n), {(i,false) : i in flat_comps}); indices = {i in [0:n]; fl in flags | fl} in drop(indices, 2);

Examples of Parallel Algorithms in NESL (cont) Primes: Improving depth Analyze of Work & Depth: Work: clearly most of the work is done at the top level of recursion, which does O (n log log n) work, and therefore the total work is O (n log log n) Depth: since each recursion level has constant depth, the total depth is proportional to the number of levels. The number of levels is log log n (the size of the problem at the i th level is n 1/2^d => d = log log n) and therefore the depth is O (log log n) This algorithm remains work-efficient and greatly improves the depth.

Examples of Parallel Algorithms in NESL (cont) Sparce Matrix Multiplication Sparce matices: most elements are zero Representation in NESL: A = [[(0, 2.0), (1, -1.0)], A = [(0, -1.0), (1, 2.0), (2, -1.0)], [(1, -1.0), (2, 2.0), (3, -1.0)], [(2, -1.0), (3, 2.0)]] E.g. multiply a sparce matrix A with a dense vector x. The dot product Ax in NESL is: {sum({v * x[i] : (i,v) in row}) : row in A}; Let n be the number of nonzero elements in the row, then depth of the computation = the depth of the sum = O ( log n ) work = sum of the work across the elements = O (n)

Examples of Parallel Algorithms in NESL (cont) Planar Convex Hull Problem: Given n points in the plane, find which of them lie on the perimeter of the smallest convex region that contains all points. An example of nested parallelism for divide-and-conquer algorithms. Quickhull algorithm (similar to Quicksort): The strategy is to pick a pivot element, split the data based on the pivot, and recurse on each of the split sets. Worst case performance is O (n 2 ) and the worst case depth is O (n).

Examples of Parallel Algorithms in NESL (cont) hsplit(set,A,P) & hsplit(set,P,A) cross product (p, (A,P)) pm: farthest from line A-P Recursively: hsplit(set’,A,pm) hsplit(set’,pm,P) Ignores elements below the line

Examples of Parallel Algorithms in NESL (cont) Performance analysis of Quickhull: Each recursive call has constant depth and O(n) work. However, since many points might be deleted on each step, the work could be significantly less. As in Quicksort, worst case performance is O (n 2 ) and the worst case depth is O (n). For m hull points the best case times are O (n) work and O( log m ) depth.

Summary They formalize a clear-cut formal language-based model for analyzing performance Work & depth based model is directly defined through a programming language, rather than a specific machine It can be applied to various classes of machines using mappings that count for number of processors, processing and communication costs. NESL allows simple description of parallel algorithms and makes use of data parallel constructs and the ability to nest such constructs..

Summary NESL hides the CPU/Memory allocation, and inter- processor communication details by providing an abstraction of parallelism. The current NESL implementation is based on an intermediate language (VCODE )and a library of low level vector routines (CVL) For more information on how NESL compiler is implemented: “Implementation of a Portable Nested Data-Parallel Language”Guy E. Blelloch, Siddhartha Chatterjee, Jonathan C. Hardwick, Jay Sipelstein, and Marco Zagha.

Discussion Parallel Processing - Sensor Network Analogy: Local processing -> Aggregation. Work corresponds to total aggregation cost. Moving levels up -> Collecting aggregated results from children nodes. Depth->Depth of routing tree in sensor network. Implies communication cost. Latency->Cost to transmit data between motes. In parallel computation the goal is to reduce execution time. Sensor networks aim to reduce power consumption by minimizing communications. Execution time is also an issue when real time requirements are imposed.

Discussion NESL and TAG queries? Can latency be hidden by assigning multiple tasks to motes? Can you perform different operations on an array's elements in parallel? Is it hard to add one more parallelism mechanism besides apply-to-each and parallel functions?