Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms) Courtesy : Prof. Chowdhury(SUNY-SB) and Prof.Grossman(UW)’s course note slides are used.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms Quicksort
Advertisements

Section 5: More Parallel Algorithms
Lecture 3: Parallel Algorithm Design
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Practice Quiz Question
Lecture 7 : Parallel Algorithms (focus on sorting algorithms) Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture.
Theory of Algorithms: Divide and Conquer
Section 4: Parallel Algorithms Michelle Kuttel
ISOM MIS 215 Module 7 – Sorting. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Spring 2015 Lecture 5: QuickSort & Selection
Quicksort CS 3358 Data Structures. Sorting II/ Slide 2 Introduction Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case:
25 May Quick Sort (11.2) CSE 2011 Winter 2011.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last.
CSE332: Data Abstractions Lecture 19: Analysis of Fork-Join Parallel Programs Dan Grossman Spring 2010.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: May.
CSE332: Data Abstractions Lecture 18: Introduction to Multithreading and Fork-Join Parallelism Dan Grossman Spring 2010.
Introduction to Analysis of Algorithms
Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.
Quicksort.
Functional Design and Programming Lecture 4: Sorting.
Elementary Data Structures and Algorithms
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
Sorting II/ Slide 1 Lecture 24 May 15, 2011 l merge-sorting l quick-sorting.
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Design and Analysis of Algorithms - Chapter 41 Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two.
1 Divide and Conquer Binary Search Mergesort Recurrence Relations CSE Lecture 4 – Algorithms II.
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
Lecture 2 Computational Complexity
CSE332: Data Abstractions Lecture 20: Analysis of Fork-Join Parallel Programs Tyler Robison Summer
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.
Chapter 12 Recursion, Complexity, and Searching and Sorting
1 Programming with Recursion. 2 Recursive Function Call A recursive call is a function call in which the called function is the same as the one making.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Lecture 2 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.
CMPT 438 Algorithms. Why Study Algorithms? Necessary in any computer programming problem ▫Improve algorithm efficiency: run faster, process more data,
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Nicki Dell Spring 2014.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Java Methods Big-O Analysis of Algorithms Object-Oriented Programming
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
CSE332: Data Abstractions Lecture 19: Introduction to Multithreading and Fork-Join Parallelism Tyler Robison Summer
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
Lecture 2 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015.
Chapter 9 Recursion © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
1 Chapter 2 Algorithm Analysis Reading: Chapter 2.
Parallelism idea Example: Sum elements of a large array Idea: Have 4 threads simultaneously sum 1/4 of the array –Warning: This is an inferior first approach.
Analysis of Fork-Join Parallel Programs. Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of.
Lecture 2 Algorithm Analysis
Lecture 3: Parallel Algorithm Design
Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms)
CSE 373: Data Structures & Algorithms Introduction to Parallelism and Concurrency Riley Porter Winter 2017.
Instructor: Lilian de Greef Quarter: Summer 2017
CSE 332: Analysis of Fork-Join Parallel Programs
CSE 332: Intro to Parallelism: Multithreading and Fork-Join
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last.
CSE373: Data Structures & Algorithms Lecture 26: Introduction to Multithreading & Fork-Join Parallelism Catie Baker Spring 2015.
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Catie Baker Spring 2015.
CSE332: Data Abstractions Lecture 18: Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Spring 2012.
CSE373: Data Structures & Algorithms Lecture 21: Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Fall 2013.
Parallelism for summation
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January.
Presentation transcript:

Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms) Courtesy : Prof. Chowdhury(SUNY-SB) and Prof.Grossman(UW)’s course note slides are used in this lecture note

Parallel/Distributed Algorithms Parallel program(algorithm) A program (algorithm) is divided into multiple processes(threads) which are run on multiple processors The processors normally are in one machine execute one program at a time have high speed communications between them Distributed program(algorithm) A program (algorithm) is divided into multiple processes which are run on multiple distinct machines The multiple machines are usual connected by network. Machines used typically are workstations running multiple programs.

Parallelism idea Example: Sum elements of a large array Idea: Have 4 threads simultaneously sum 1/4 of the array Warning: This is an inferior first approach ans0 ans1 ans2 ans3 + ans Create 4 thread objects, each given a portion of the work Call start() on each thread object to actually run it in parallel Wait for threads to finish using join() Add together their 4 answers for the final result Problems? : processor utilization, subtask size

A Better Approach Problem Solution is to use lots of threads, far more than the number of processors ans0 ans1 … ansN ans 1.reusable and efficient across platforms 2.Use processors “available to you now” : Hand out “work chunks” as you go 3.Load balance in general subproblems may take significantly different amounts of time

Naïve algorithm is poor Suppose we create 1 thread to process every 1000 elements int sum(int[] arr){ … int numThreads = arr.length / 1000; SumThread[] ts = new SumThread[numThreads]; … } Then combining results will have arr.length / 1000 additions Linear in size of array (with constant factor 1/1000) Previously we had only 4 pieces (constant in size of array) In the extreme, if we create 1 thread for every 1 element, the loop to combine results has length-of-array iterations Just like the original sequential algorithm

A better idea : devide-and-conqure This is straightforward to implement using divide-and-conquer Parallelism for the recursive calls The key is divide-and-conquer parallelizes the result-combining If you have enough processors, total time is height of the tree: O( log n) (optimal, exponentially faster than sequential O(n)) We will write all our parallel algorithms in this style

Divide-and-conquer to the rescue! The key is to do the result-combining in parallel as well And using recursive divide-and-conquer makes this natural Easier to write and more efficient asymptotically! 7Sophomoric Parallelism and Concurrency, Lecture 1 class SumThread extends java.lang.Thread { int lo; int hi; int[] arr; // arguments int ans = 0; // result SumThread(int[] a, int l, int h) { … } public void run(){ // override if(hi – lo < SEQUENTIAL_CUTOFF) for(int i=lo; i < hi; i++) ans += arr[i]; else { SumThread left = new SumThread(arr,lo,(hi+lo)/2); SumThread right= new SumThread(arr,(hi+lo)/2,hi); left.start(); right.start(); left.join(); // don’t move this up a line – why? right.join(); ans = left.ans + right.ans; } int sum(int[] arr){ SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; }

Being realistic In theory, you can divide down to single elements, do all your result-combining in parallel and get optimal speedup Total time O(n/numProcessors + log n) In practice, creating all those threads and communicating swamps the savings, so: Use a sequential cutoff, typically around Eliminates almost all the recursive thread creation (bottom levels of tree) Exactly like quicksort switching to insertion sort for small subproblems, but more important here Do not create two recursive threads; create one and do the other “yourself” Cuts the number of threads created by another 2x

Similar Problems Maximum or minimum element Is there an element satisfying some property (e.g., is there a 17)? Left-most element satisfying some property (e.g., first 17) Corners of a rectangle containing all points (a bounding box) Counts, for example, number of strings that start with a vowel Computations of this form are called reductions

Even easier: Maps (Data Parallelism) A map operates on each element of a collection independently to create a new collection of the same size No combining results For arrays, this is so trivial some hardware has direct support Canonical example: Vector addition int[] vector_add(int[] arr1, int[] arr2){ assert (arr1.length == arr2.length); result = new int[arr1.length]; FORALL(i=0; i < arr1.length; i++) { result[i] = arr1[i] + arr2[i]; } return result; }

Maps and reductions Maps and reductions: the “workhorses” of parallel programming By far the two most important and common patterns Two more-advanced patterns in next lecture Learn to recognize when an algorithm can be written in terms of maps and reductions Use maps and reductions to describe (parallel) algorithms

Divide-and-Conquer in more detail

Divide-and-Conquer Divide divide the original problem into smaller subproblems that are easier are to solve Conquer solve the smaller subproblems (perhaps recursively) Merge combine the solutions to the smaller subproblems to obtain a solution for the original problem Can be extended to parallel algorithm

Divide-and-Conquer The divide-and-conquer paradigm improves program modularity, and often leads to simple and efficient algorithms Since the subproblems created in the divide step are often independent, they can be solved in parallel If the subproblems are solved recursively, each recursive divide step generates even more independent subproblems to be solved in parallel In order to obtain a highly parallel algorithm it is often necessary to parallelize the divide and merge steps, too

Example of Parallel Program (divide-and-conquer approach) spawn Subroutine can execute at the same time as its parent sync Wait until all children are done A procedure cannot safely use the return values of the children it has spawned until it executes a sync statement. Fibonacci(n) 1: if n < 2 2: return n 3: x = spawn Fibonacci(n-1) 4: y = spawn Fibonacci(n-2) 5: sync 6: return x + y

Analyzing algorithms Like all algorithms, parallel algorithms should be: Correct Efficient For our algorithms so far, correctness is “obvious” so we’ll focus on efficiency Want asymptotic bounds Want to analyze the algorithm without regard to a specific number of processors

Performance Measure T p running time of an algorithm on p processors T 1 : work running time of algorithm on 1 processor T ∞ : span the longest time to execute the algorithm on infinite number of processors.

Performance Measure Lower bounds on T p T p >= T 1 / p T p >= T ∞ P processors cannot do more than infinite number of processors Speedup T 1 / T p : speedup on p processors Parallelism T 1 / T ∞ Max possible parallel speedup

Related Sorting Algorithms Sorting Algorithms Sort an array A[1,…,n] of n keys (using p<=n processors) Examples of divide-and-conquer methods Merge-sort Quick-sort

Merge-Sort Basic Plan Divide array into two halves Recursively sort each half Merge two halves to make sorted whole

Merge-Sort Algorithm

Performance analysis

Time Complexity Notation Asymptotic Notation ( 점근적 표기법 ) A way to describe the behavior of functions in the limit ( 어떤 함수의 인수값이 무한히 커질때, 그 함수의 증가율을 더 간단한 함수를 이용해 나타내는 것 )

Time Complexity Notation O notation – upper bound O(g(n)) = { h(n): ∃ positive constants c, n 0 such that 0 ≤ h(n) ≤ cg(n), ∀ n ≥ n 0 } Ω notation – lower bound Ω(g(n)) = {h(n): ∃ positive constants c > 0, n 0 such that 0 ≤ cg(n) ≤ h(n), ∀ n ≥ n 0 } Θ notation – tight bound Θ(g(n)) = {h(n): ∃ positive constants c 1, c 2, n 0 such that 0 ≤ c 1 g(n) ≤ h(n) ≤ c 2 g(n), ∀ n ≥ n 0 }

Parallel merge-sort

Performance Analysis Too small! Need to parallelize Merge step

Parallel Merge

Parallel merge

Parallel Merge

(Sequential) Quick-Sort algorithm a recursive procedure Select one of the numbers as pivot Divide the list into two sublists: a “low list” containing numbers smaller than the pivot, and a “high list” containing numbers larger than the pivot The low list and high list recursively repeat the procedure to sort themselves The final sorted result is the concatenation of the sorted low list, the pivot, and the sorted high list

(Sequential) Quick-Sort algorithm Given a list of numbers: {79, 17, 14, 65, 89, 4, 95, 22, 63, 11} The first number, 79, is chosen as pivot Low list contains {17, 14, 65, 4, 22, 63, 11} High list contains {89, 95} For sublist {17, 14, 65, 4, 22, 63, 11}, choose 17 as pivot Low list contains {14, 4, 11} High list contains {64, 22, 63}... {4, 11, 14, 17, 22, 63, 65} is the sorted result of sublist {17, 14, 65, 4, 22, 63, 11} For sublist {89, 95} choose 89 as pivot Low list is empty (no need for further recursions) High list contains {95} (no need for further recursions) {89, 95} is the sorted result of sublist {89, 95} Final sorted result: {4, 11, 14, 17, 22, 63, 65, 79, 89, 95}

Illustation of Quick-Sort

Randomized quick-sort Par-Randomized-QuickSort ( A[ q : r ] ) 1. n <- r ― q if n <= 30 then 3. sort A[ q : r ] using any sorting algorithm 4. else 5. select a random element x from A[ q : r ] 6. k <- Par-Partition ( A[ q : r ], x ) 7. spawn Par-Randomized-QuickSort ( A[ q : k ― 1 ] ) 8. Par-Randomized-QuickSort ( A[ k + 1 : r ] ) 9. sync Worst-Case Time Complexity of Quick-Sort : O(N^2) Average Time Complexity of Sequential Randomized Quick-Sort : O(NlogN) (recursion depth of line 7-8 is roughly O(logN). Line 5 takes O(N))

Parallel Randomized Quick-Sort

Parallel partition Recursive divide-and-conquer

Parallel Partition Algorithm Analysis

Prefix Sums

Performance analysis