CSE 332: Analysis of Fork-Join Parallel Programs

Slides:



Advertisements
Similar presentations
Section 5: More Parallel Algorithms
Advertisements

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by.
CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012 August 1, 2012CSE 332 Data Abstractions, Summer
Section 4: Parallel Algorithms Michelle Kuttel
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by.
CSE332: Data Abstractions Lecture 20: Parallel Prefix and Parallel Sorting Dan Grossman Spring 2010.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last.
CSE332: Data Abstractions Lecture 19: Analysis of Fork-Join Parallel Programs Dan Grossman Spring 2010.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: May.
CSE332: Data Abstractions Lecture 18: Introduction to Multithreading and Fork-Join Parallelism Dan Grossman Spring 2010.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Original Work by:
1 Apr 5, 2012 ForkJoin New in Java 7. Incrementor I package helloWorld; import java.util.concurrent.ForkJoinPool; import java.util.concurrent.RecursiveTask;
CSE332: Data Abstractions Lecture 20: Analysis of Fork-Join Parallel Programs Tyler Robison Summer
Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms) Courtesy : Prof. Chowdhury(SUNY-SB) and Prof.Grossman(UW)’s course note slides are used.
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Nicki Dell Spring 2014.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
Multithreading in Java Sameer Singh Chauhan Lecturer, I. T. Dept., SVIT, Vasad.
CSE332: Data Abstractions Lecture 19: Introduction to Multithreading and Fork-Join Parallelism Tyler Robison Summer
Advanced Computer Networks Lecture 1 - Parallelization 1.
CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015.
Concurrency and Performance Based on slides by Henri Casanova.
Concurrency in Java MD. ANISUR RAHMAN. slide 2 Concurrency  Multiprogramming  Single processor runs several programs at the same time  Each program.
Parallelism idea Example: Sum elements of a large array Idea: Have 4 threads simultaneously sum 1/4 of the array –Warning: This is an inferior first approach.
Analysis of Fork-Join Parallel Programs. Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Searching and Sorting Searching algorithms with simple arrays
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms)
CSE 373: Data Structures & Algorithms Introduction to Parallelism and Concurrency Riley Porter Winter 2017.
6/16/2010 Parallel Performance Parallel Performance.
Instructor: Lilian de Greef Quarter: Summer 2017
CSE 332: Intro to Parallelism: Multithreading and Fork-Join
CSE373: Data Structures & Algorithms Lecture 26: Parallel Reductions, Maps, and Algorithm Analysis Aaron Bauer Winter 2014.
Algorithm Analysis CSE 2011 Winter September 2018.
Hashing Exercises.
Instructor: Lilian de Greef Quarter: Summer 2017
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last.
MapReduce Simplied Data Processing on Large Clusters
Optimizing Malloc and Free
Big-Oh and Execution Time: A Review
Algorithm Analysis (not included in any exams!)
Building Java Programs
ForkJoin New in Java 7 Apr 5, 2012.
CSE 143 Lecture 5 Binary search; complexity reading:
CSE373: Data Structures & Algorithms Lecture 26: Introduction to Multithreading & Fork-Join Parallelism Catie Baker Spring 2015.
CSE373: Data Structures and Algorithms Lecture 4: Asymptotic Analysis
Chapter 12: Analysis of Algorithms
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Catie Baker Spring 2015.
CSE 143 Lecture 5 More Stacks and Queues; Complexity (Big-Oh)
CSE332: Data Abstractions Lecture 18: Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Spring 2012.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CSE332: Data Abstractions Lecture 20: Parallel Prefix, Pack, and Sorting Dan Grossman Spring 2012.
ForkJoin New in Java 7 Apr 5, 2012.
Lecture 5: complexity reading:
CSE373: Data Structures & Algorithms Lecture 21: Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Fall 2013.
CSE332: Data Abstractions Lecture 19: Analysis of Fork-Join Parallel Programs Dan Grossman Spring 2012.
Parallelism for summation
Sub-Quadratic Sorting Algorithms
CSE373: Data Structure & Algorithms Lecture 21: Comparison Sorting
Building Java Programs
Building Java Programs
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Algorithmic complexity: Speed of algorithms
Building Java Programs
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January.
Sum this up for me Let’s write a method to calculate the sum from 1 to some n public static int sum1(int n) { int sum = 0; for (int i = 1; i
Analysis of Algorithms
Course Description Algorithms are: Recipes for solving problems.
CSE 332: Parallel Algorithms
Presentation transcript:

CSE 332: Analysis of Fork-Join Parallel Programs 7/4/2018 CSE 332: Analysis of Fork-Join Parallel Programs Richard Anderson Spring 2016

New Story: Shared Memory with Threads Heap for all objects and static fields, shared by all threads Threads, each with own unshared call stack and “program counter” pc=0x… … … pc=0x… pc=0x… … …

Fork-Join Parallelism Define thread Java: define subclass of java.lang.Thread, override run Fork: instantiate a thread and start executing Java: create thread object, call start() Join: wait for thread to terminate Java: call join() method, which returns when thread finishes Above uses basic thread library build into Java Later we’ll introduce a better ForkJoin Java library designed for parallel programming

Sum with Threads For starters: have 4 threads simultaneously sum ¼ of the array ans0 ans1 ans2 ans3 + ans Create 4 thread objects, each given ¼ of the array Call start() on each thread object to run it in parallel Wait for threads to finish using join() Add together their 4 answers for the final result

Part 1: define thread class class SumThread extends java.lang.Thread { int lo; // fields, passed to constructor int hi; // so threads know what to do. int[] arr; int ans = 0; // result SumThread(int[] a, int l, int h) { lo=l; hi=h; arr=a; } public void run() { //override must have this type for(int i=lo; i < hi; i++) ans += arr[i]; Because we must override a no-arguments/no-result run, we use fields to communicate across threads

Part 2: sum routine int sum(int[] arr){// can be a static method int len = arr.length; int ans = 0; SumThread[] ts = new SumThread[4]; for(int i=0; i < 4; i++){// do parallel computations ts[i] = new SumThread(arr,i*len/4,(i+1)*len/4); ts[i].start(); } for(int i=0; i < 4; i++) { // combine results ts[i].join(); // wait for helper to finish! ans += ts[i].ans; return ans;

Recall: Parallel Sum Sum up N numbers in an array 7/4/2018 Recall: Parallel Sum Sum up N numbers in an array Let’s implement this with threads... + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + N

Code looks something like this (using Java Threads) class SumThread extends java.lang.Thread { int lo; int hi; int[] arr; // fields to know what to do int ans = 0; // result SumThread(int[] a, int l, int h) { … } public void run(){ // override if(hi – lo < SEQUENTIAL_CUTOFF) for(int i=lo; i < hi; i++) ans += arr[i]; else { SumThread left = new SumThread(arr,lo,(hi+lo)/2); SumThread right= new SumThread(arr,(hi+lo)/2,hi); left.start(); right.start(); left.join(); // don’t move this up a line – why? right.join(); ans = left.ans + right.ans; } int sum(int[] arr){ // just make one thread! SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; The key is to do the result-combining in parallel as well And using recursive divide-and-conquer makes this natural Easier to write and more efficient asymptotically!

Recursive problem decomposition Thread: sum range [0,10) Thread: sum range [0,5) Thread: sum range [0,2) Thread: sum range [0,1) (return arr[0]) Thread: sum range [1,2) (return arr[1]) add results from two helper threads Thread: sum range [2,5) Thread: sum range [2,3) (return arr[2]) Thread: sum range [3,5) Thread: sum range [3,4) (return arr[3]) Thread: sum range [4,5) (return arr[4]) Thread: sum range [5,10) Thread: sum range [5,7) Thread: sum range [5,6) (return arr[5]) Thread: sum range [6,7) (return arr[6]) Thread: sum range [7,10) Thread: sum range [7,8) (return arr[7]) Thread: sum range [8,10) Thread: sum range [8,9) (return arr[8]) Thread: sum range [9,10) (return arr[9])

Divide-and-conquer Same approach useful for many problems beyond sum If you have enough processors, total time O(log n) Next lecture: study reality of P << n processors Will write all our parallel algorithms in this style But using a special fork-join library engineered for this style Takes care of scheduling the computation well Often relies on operations being associative (like +) + + + + + + + + + + + + + + +

Thread Overhead Creating and managing threads incurs cost Two optimizations: Use a sequential cutoff, typically around 500-1000 Eliminates lots of tiny threads Do not create two recursive threads; create one thread and do the other piece of work “yourself” Cuts the number of threads created by another 2x

Half the threads! order of last 4 lines Is critical – why? // wasteful: don’t SumThread left = … SumThread right = … left.start(); right.start(); left.join(); right.join(); ans=left.ans+right.ans; // better: do!! SumThread left = … SumThread right = … left.start(); right.run(); left.join(); // no right.join needed ans=left.ans+right.ans; Note: run is a normal function call! execution won’t continue until we are done with run

Better Java Thread Library Even with all this care, Java’s threads are too “heavyweight” Constant factors, especially space overhead Creating 20,000 Java threads just a bad idea  The ForkJoin Framework is designed to meet the needs of divide-and-conquer fork-join parallelism In the Java 7 standard libraries (Also available for Java 6 as a downloaded .jar file) Section will focus on pragmatics/logistics Similar libraries available for other languages C/C++: Cilk (inventors), Intel’s Thread Building Blocks C#: Task Parallel Library …

Different terms, same basic idea To use the ForkJoin Framework: A little standard set-up code (e.g., create a ForkJoinPool) Don’t subclass Thread Do subclass RecursiveTask<V> Don’t override run Do override compute Do not use an ans field Do return a V from compute Don’t call start Do call fork Don’t just call join Do call join (which returns answer) Don’t call run to hand-optimize Do call compute to hand-optimize Don’t have a topmost call to run Do create a pool and call invoke See the web page for (linked in to project 3 description): “A Beginner’s Introduction to the ForkJoin Framework”

Fork Join Framework Version: (missing imports) class SumArray extends RecursiveTask<Integer> { int lo; int hi; int[] arr; // fields to know what to do SumArray(int[] a, int l, int h) { … } protected Integer compute(){// return answer if(hi – lo < SEQUENTIAL_CUTOFF) { int ans = 0; // local var, not a field for(int i=lo; i < hi; i++) ans += arr[i]; return ans; } else { SumArray left = new SumArray(arr,lo,(hi+lo)/2); SumArray right= new SumArray(arr,(hi+lo)/2,hi); left.fork(); // fork a thread and calls compute int rightAns = right.compute();//call compute directly int leftAns = left.join(); // get result from left return leftAns + rightAns; } static final ForkJoinPool fjPool = new ForkJoinPool(); int sum(int[] arr){ return fjPool.invoke(new SumArray(arr,0,arr.length)); // invoke returns the value compute returns

Parallel Sum Sum up N numbers in an array + + + + + + + + + + + + + + 7/4/2018 Parallel Sum Sum up N numbers in an array + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + N

Parallel Max? + + + + + + + + + + + + + + + + + + + + + + + + + + + + 7/4/2018 Parallel Max? + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Change all +’s to “max” operations

Reductions Same trick works for many tasks, e.g., is there an element satisfying some property (e.g., prime) left-most element satisfying some property (e.g., first prime) smallest rectangle encompassing a set of points (proj3) counts: number of strings that start with a vowel are these elements in sorted order? Called a reduction, or reduce operation reduce a collection of data items to a single item result can be more than a single value, e.g., produce histogram from a set of test scores Very common parallel programming pattern

Parallel Vector Scaling 7/4/2018 Parallel Vector Scaling Multiply every element in the array by 2 Change all +’s to “max” operations

Maps Another common parallel programming pattern A map operates on each element of a collection of data to produce a new collection of the same size each element is processed independently of the others, e.g. vector scaling vector addition test property of each element (is it prime) uppercase to lowercase ... Another common parallel programming pattern

Maps in ForkJoin Framework class VecAdd extends RecursiveAction { int lo; int hi; int[] res; int[] arr1; int[] arr2; VecAdd(int l,int h,int[] r,int[] a1,int[] a2){ … } protected void compute(){ if(hi – lo < SEQUENTIAL_CUTOFF) { for(int i=lo; i < hi; i++) res[i] = arr1[i] + arr2[i]; } else { int mid = (hi+lo)/2; VecAdd left = new VecAdd(lo,mid,res,arr1,arr2); VecAdd right= new VecAdd(mid,hi,res,arr1,arr2); left.fork(); right.compute(); left.join(); } static final ForkJoinPool fjPool = new ForkJoinPool(); int[] add(int[] arr1, int[] arr2){ assert (arr1.length == arr2.length); int[] ans = new int[arr1.length]; fjPool.invoke(new VecAdd(0,arr.length,ans,arr1,arr2); return ans; Even though there is no result-combining, it still helps with load balancing to create many small tasks Maybe not for vector-add but for more compute-intensive maps The forking is O(log n) whereas theoretically other approaches to vector-add is O(1)

Maps and Reductions Maps and reductions: the “workhorses” of parallel programming By far the most important and common patterns Learn to recognize when an algorithm can be written in terms of maps and reductions makes parallel programming easy (plug and play)

Distributed Map Reduce You may have heard of Google’s map/reduce or open-source version called Hadoop powers much of Google’s infrastructure Idea: maps/reductions using many machines same principles, applied to distributed computing system takes care of distributing data, fault-tolerance you just write code to handle one element, reduce a collection Co-developed by Jeff Dean (UW alum!)

Maps and Reductions on Trees Max value in a min-heap How to parallelize? Is this a map or a reduce? Complexity? 10 20 15 40 60 85 99 50 700 65

Analyzing Parallel Programs Let TP be the running time on P processors Two key measures of run-time: Work: How long it would take 1 processor = T1 Span: How long it would take infinity processors = T The hypothetical ideal for parallelization This is the longest “dependence chain” in the computation Example: O(log n) for summing an array Also called “critical path length” or “computational depth”

The DAG Fork-join programs can be modeled with a DAG nodes: pieces of work edges: order dependencies What’s T1 (work): What’s T (span): A fork creates two children new thread continuation of current thread A join makes a node with two incoming edges terminated thread

Divide and Conquer Algorithms Our fork and join frequently look like this: base cases divide combine results In this context, the span (T) is: The longest dependence-chain; longest ‘branch’ in parallel ‘tree’ Example: O(log n) for summing an array; we halve the data down to our cut-off, then add back together; O(log n) steps, O(1) time for each Also called “critical path length” or “computational depth”

Parallel Speed-up Speed-up on P processors: T1 / TP 7/4/2018 Parallel Speed-up Speed-up on P processors: T1 / TP If speed-up is P, we call it perfect linear speed-up e.g., doubling P halves running time hard to achieve in practice Parallelism is the maximum possible speed-up: T1 / T  if you had infinite processors Example: T_inf = 5s, T_4 = 25s, T_1 = 100s

Estimating Tp How to estimate TP (e.g., P = 4)? 7/4/2018 Estimating Tp How to estimate TP (e.g., P = 4)? Lower bounds on TP (ignoring memory, caching...) T  T1 / P which one is the tighter (higher) lower bound? The ForkJoin Java Framework achieves the following expected time asymptotic bound: TP ϵ O(T  + T1 / P) this bound is optimal 1. is better for large P, 2. better for large P

Amdahl’s Law Most programs have The latter become bottlenecks 7/4/2018 Amdahl’s Law Most programs have parts that parallelize well parts that don’t parallelize at all The latter become bottlenecks Examples?

Amdahl’s Law 1 = T1 = S + (1 – S) TP = Let T1 = 1 unit of time 7/4/2018 Amdahl’s Law Let T1 = 1 unit of time Let S = proportion that can’t be parallelized 1 = T1 = S + (1 – S) Suppose we get perfect linear speedup on the parallel portion: TP = So the overall speed-up on P processors is (Amdahl’s Law): T1 / T P = T1 / T  = If 1/3 of your program is parallelizable, max speedup is: Examples?

Pretty Bad News Suppose 25% of your program is sequential. 7/4/2018 Pretty Bad News Suppose 25% of your program is sequential. Then a billion processors won’t give you more than a 4x speedup! What portion of your program must be parallelizable to get 10x speedup on a 1000 core GPU? 10 <= 1 / (S + (1-S)/1000) Motivates minimizing sequential portions of your programs S is about 0.1 -> roughy 90%

Take Aways Parallel algorithms can be a big win 7/4/2018 Take Aways Parallel algorithms can be a big win Many fit standard patterns that are easy to implement Can’t just rely on more processors to make things faster (Amdahl’s Law) S is about 0.1 -> roughy 90%