CSE 332: Intro to Parallelism: Multithreading and Fork-Join

Slides:



Advertisements
Similar presentations
Day 10 Threads. Threads and Processes  Process is seen as two entities Unit of resource allocation (process or task) Unit of dispatch or scheduling (thread.
Advertisements

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last.
CSE332: Data Abstractions Lecture 19: Analysis of Fork-Join Parallel Programs Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 18: Introduction to Multithreading and Fork-Join Parallelism Dan Grossman Spring 2010.
Processes CSCI 444/544 Operating Systems Fall 2008.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
OOP&M - theory lectures1 OOP&M – and there it was an eighth day “I f you know what I know, what am I doing here? ” - Aristoteles -
1 Chapter 4 Threads Threads: Resource ownership and execution.
Multithreading in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Original Work by:
1 Apr 5, 2012 ForkJoin New in Java 7. Incrementor I package helloWorld; import java.util.concurrent.ForkJoinPool; import java.util.concurrent.RecursiveTask;
Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms) Courtesy : Prof. Chowdhury(SUNY-SB) and Prof.Grossman(UW)’s course note slides are used.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
Multithreading in Java Sameer Singh Chauhan Lecturer, I. T. Dept., SVIT, Vasad.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Department of Computer Science and Software Engineering
CSE332: Data Abstractions Lecture 19: Introduction to Multithreading and Fork-Join Parallelism Tyler Robison Summer
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015.
Concurrency in Java MD. ANISUR RAHMAN. slide 2 Concurrency  Multiprogramming  Single processor runs several programs at the same time  Each program.
Parallelism idea Example: Sum elements of a large array Idea: Have 4 threads simultaneously sum 1/4 of the array –Warning: This is an inferior first approach.
Analysis of Fork-Join Parallel Programs. Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of.
Concurrent Programming in Java Based on Notes by J. Johns (based on Java in a Nutshell, Learning Java) Also Java Tutorial, Concurrent Programming in Java.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
A brief intro to: Parallelism, Threads, and Concurrency
Processes and threads.
Multi Threading.
CSE 374 Programming Concepts & Tools
Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms)
Processes and Threads Processes and their scheduling
CSE 373: Data Structures & Algorithms Introduction to Parallelism and Concurrency Riley Porter Winter 2017.
Day 12 Threads.
Instructor: Lilian de Greef Quarter: Summer 2017
CSE 332: Analysis of Fork-Join Parallel Programs
CS399 New Beginnings Jonathan Walpole.
Process Management Presented By Aditya Gupta Assistant Professor
Computer Engg, IIT(BHU)
Instructor: Lilian de Greef Quarter: Summer 2017
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last.
Big-Oh and Execution Time: A Review
ForkJoin New in Java 7 Apr 5, 2012.
CSE373: Data Structures & Algorithms Lecture 26: Introduction to Multithreading & Fork-Join Parallelism Catie Baker Spring 2015.
Threads and Data Sharing
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Catie Baker Spring 2015.
CSE332: Data Abstractions Lecture 18: Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Spring 2012.
Process & its States Lecture 5.
ForkJoin New in Java 7 Apr 5, 2012.
Shared Memory Programming
CSE373: Data Structures & Algorithms Lecture 21: Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Fall 2013.
CSE332: Data Abstractions Lecture 19: Analysis of Fork-Join Parallel Programs Dan Grossman Spring 2012.
Thread Implementation Issues
Lecture Topics: 11/1 General Operating System Concepts Processes
Parallelism for summation
Multithreaded Programming
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Threads vs. Processes Hank Levy 1.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January.
Exercise 1 : ex1.java Thread Creation and Execution
CSE 153 Design of Operating Systems Winter 2019
Lecture 4: Instruction Set Design/Pipelining
CSE 332: Concurrency and Locks
Concurrency: Threads, Address Spaces, and Processes
Presentation transcript:

CSE 332: Intro to Parallelism: Multithreading and Fork-Join 7/23/2018 CSE 332: Intro to Parallelism: Multithreading and Fork-Join Richard Anderson Spring 2016

Announcements Read parallel computing notes by Dan Grossman 2.1-3.4 7/23/2018 Announcements Read parallel computing notes by Dan Grossman 2.1-3.4 Homework 5 – available Wednesday Exams – not graded yet

Sequential Sum up N numbers in an array Complexity?

Parallel Sum Sum up N numbers in an array with two processors 7/23/2018 Parallel Sum Sum up N numbers in an array with two processors Divide in half, each processor does half the work -> 2x speedup

Parallel Sum Sum up N numbers in an array with N processors? 7/23/2018 Parallel Sum Sum up N numbers in an array with N processors? Not Nx speedup. O(log n) is as good as you can do. See if they can come up with it. Next slide shows the approach.

Parallel Sum Sum up N numbers in an array Complexity? 7/23/2018 Parallel Sum Sum up N numbers in an array Complexity? How many processors? Faster with infinite processors? + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + N

Changing a Major Assumption 7/23/2018 Changing a Major Assumption So far, we have assumed: One thing happens at a time Called sequential programming Dominated until roughly 2005 what changed?

A Simplified History From roughly 1980-2005, desktop computers got exponentially faster at running sequential programs About twice as fast every couple years Writing parallel (multi-threaded) code is harder than sequential Especially in common languages like Java and C But nobody knows how to continue this Increasing clock rate generates too much heat Relative cost of memory access is too high But we can keep making “wires exponentially smaller” (Moore’s “Law”), so put multiple processors on the same chip (“multicore”)

Who Implements Parallelism 7/23/2018 Who Implements Parallelism User Application Operating System Programming Language, Compiler Algorithm Processor Hardware We’ll focus on algorithm level

Parallelism vs. Concurrency Use extra resources to solve a problem faster Concurrency: Manage access to shared resources requests work resource resources

An analogy A program is like a recipe for a cook Sequential: one cook who does one thing at a time Parallelism: (Let’s get the job done faster!) Have lots of potatoes to slice? Hire helpers, hand out potatoes and knives But too many chefs and you spend all your time coordinating Concurrency: (We need to manage a shared resource) Lots of cooks making different things, but only 4 stove burners Want to allow access to all 4 burners, but not cause spills or incorrect burner settings

Shared Memory with Threads Old story: A running program has One program counter (current statement executing) One call stack (with each stack frame holding local variables) Objects in the heap created by memory allocation (i.e., new) (nothing to do with data structure called a heap) Static fields New story: A set of threads, each with its own program counter & call stack No access to another thread’s local variables Threads can share static fields / objects To communicate, write values to some shared location that another thread reads from

Old Story: one call stack, one pc Heap for all objects and static fields Call stack with local variables pc determines current statement local variables are numbers/null or heap references pc=0x… … … 13 13

New Story: Shared Memory with Threads Heap for all objects and static fields, shared by all threads Threads, each with own unshared call stack and “program counter” pc=0x… … … pc=0x… pc=0x… … …

Other models We will focus on shared memory, but you should know several other models exist and have their own advantages (see notes) Message-passing: Each thread has its own collection of objects. Communication is via explicitly sending/receiving messages Cooks working in separate kitchens, mail around ingredients Dataflow: Programmers write programs in terms of a DAG. A node executes after all of its predecessors in the graph Cooks wait to be handed results of previous steps Data parallelism: Have primitives for things like “apply function to every element of an array in parallel” Don’t spend much time on this slide

7/23/2018 Our Needs To write a shared-memory parallel program, need new primitives from a programming language or library Ways to create and run multiple things at once Let’s call these things threads Ways for threads to share memory Often just have threads with references to the same objects Ways for threads to coordinate (a.k.a. synchronize) For now, a way for one thread to wait for another to finish Other primitives when we study concurrency

7/23/2018 Threads vs. Processors What happens if you start 5 threads on a machine with only 4 processors? OS schedules threads for execution. Compete with other processes (other applications, OS, etc.).

Threads vs. Processors For sum operation: 7/23/2018 Threads vs. Processors For sum operation: with 3 processors available, using 4 threads would take 50% more time than 3 threads Work through an example with 12 numbers, 3 processors: 3 threads takes 4 units of time, 4 threads takes 2*3 = 6 units of time

Fork-Join Parallelism Define thread Java: define subclass of java.lang.Thread, override run Fork: instantiate a thread and start executing Java: create thread object, call start() Join: wait for thread to terminate Java: call join() method, which returns when thread finishes Above uses basic thread library build into Java Later we’ll introduce a better ForkJoin Java library designed for parallel programming

Sum with Threads For starters: have 4 threads simultaneously sum ¼ of the array ans0 ans1 ans2 ans3 + ans Create 4 thread objects, each given ¼ of the array Call start() on each thread object to run it in parallel Wait for threads to finish using join() Add together their 4 answers for the final result

Part 1: define thread class class SumThread extends java.lang.Thread { int lo; // fields, passed to constructor int hi; // so threads know what to do. int[] arr; int ans = 0; // result SumThread(int[] a, int l, int h) { lo=l; hi=h; arr=a; } public void run() { //override must have this type for(int i=lo; i < hi; i++) ans += arr[i]; Because we must override a no-arguments/no-result run, we use fields to communicate across threads

Part 2: sum routine int sum(int[] arr){// can be a static method int len = arr.length; int ans = 0; SumThread[] ts = new SumThread[4]; for(int i=0; i < 4; i++){// do parallel computations ts[i] = new SumThread(arr,i*len/4,(i+1)*len/4); ts[i].start(); } for(int i=0; i < 4; i++) { // combine results ts[i].join(); // wait for helper to finish! ans += ts[i].ans; return ans;

Parameterizing by number of threads int sum(int[] arr, int numTs){ int ans = 0; SumThread[] ts = new SumThread[numTs]; for(int i=0; i < numTs; i++){ ts[i] = new SumThread(arr,(i*arr.length)/numTs, ((i+1)*arr.length)/numTs); ts[i].start(); } for(int i=0; i < numTs; i++) { ts[i].join(); ans += ts[i].ans; return ans;

Recall: Parallel Sum Sum up N numbers in an array 7/23/2018 Recall: Parallel Sum Sum up N numbers in an array Let’s implement this with threads... + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + N

Code looks something like this (using Java Threads) class SumThread extends java.lang.Thread { int lo; int hi; int[] arr; // fields to know what to do int ans = 0; // result SumThread(int[] a, int l, int h) { … } public void run(){ // override if(hi – lo < SEQUENTIAL_CUTOFF) for(int i=lo; i < hi; i++) ans += arr[i]; else { SumThread left = new SumThread(arr,lo,(hi+lo)/2); SumThread right= new SumThread(arr,(hi+lo)/2,hi); left.start(); right.start(); left.join(); // don’t move this up a line – why? right.join(); ans = left.ans + right.ans; } int sum(int[] arr){ // just make one thread! SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; The key is to do the result-combining in parallel as well And using recursive divide-and-conquer makes this natural Easier to write and more efficient asymptotically!

Recursive problem decomposition Thread: sum range [0,10) Thread: sum range [0,5) Thread: sum range [0,2) Thread: sum range [0,1) (return arr[0]) Thread: sum range [1,2) (return arr[1]) add results from two helper threads Thread: sum range [2,5) Thread: sum range [2,3) (return arr[2]) Thread: sum range [3,5) Thread: sum range [3,4) (return arr[3]) Thread: sum range [4,5) (return arr[4]) Thread: sum range [5,10) Thread: sum range [5,7) Thread: sum range [5,6) (return arr[5]) Thread: sum range [6,7) (return arr[6]) Thread: sum range [7,10) Thread: sum range [7,8) (return arr[7]) Thread: sum range [8,10) Thread: sum range [8,9) (return arr[8]) Thread: sum range [9,10) (return arr[9])

Divide-and-conquer Same approach useful for many problems beyond sum If you have enough processors, total time O(log n) Next lecture: study reality of P << n processors Will write all our parallel algorithms in this style But using a special fork-join library engineered for this style Takes care of scheduling the computation well Often relies on operations being associative (like +) + + + + + + + + + + + + + + +

Thread Overhead Creating and managing threads incurs cost Two optimizations: Use a sequential cutoff, typically around 500-1000 Eliminates lots of tiny threads Do not create two recursive threads; create one thread and do the other piece of work “yourself” Cuts the number of threads created by another 2x

Half the threads! order of last 4 lines Is critical – why? // wasteful: don’t SumThread left = … SumThread right = … left.start(); right.start(); left.join(); right.join(); ans=left.ans+right.ans; // better: do!! SumThread left = … SumThread right = … left.start(); right.run(); left.join(); // no right.join needed ans=left.ans+right.ans; Note: run is a normal function call! execution won’t continue until we are done with run

Better Java Thread Library Even with all this care, Java’s threads are too “heavyweight” Constant factors, especially space overhead Creating 20,000 Java threads just a bad idea  The ForkJoin Framework is designed to meet the needs of divide-and-conquer fork-join parallelism In the Java 7 standard libraries (Also available for Java 6 as a downloaded .jar file) Section will focus on pragmatics/logistics Similar libraries available for other languages C/C++: Cilk (inventors), Intel’s Thread Building Blocks C#: Task Parallel Library …

Different terms, same basic idea To use the ForkJoin Framework: A little standard set-up code (e.g., create a ForkJoinPool) Don’t subclass Thread Do subclass RecursiveTask<V> Don’t override run Do override compute Do not use an ans field Do return a V from compute Don’t call start Do call fork Don’t just call join Do call join (which returns answer) Don’t call run to hand-optimize Do call compute to hand-optimize Don’t have a topmost call to run Do create a pool and call invoke See the web page for (linked in to project 3 description): “A Beginner’s Introduction to the ForkJoin Framework”

Fork Join Framework Version: (missing imports) class SumArray extends RecursiveTask<Integer> { int lo; int hi; int[] arr; // fields to know what to do SumArray(int[] a, int l, int h) { … } protected Integer compute(){// return answer if(hi – lo < SEQUENTIAL_CUTOFF) { int ans = 0; // local var, not a field for(int i=lo; i < hi; i++) ans += arr[i]; return ans; } else { SumArray left = new SumArray(arr,lo,(hi+lo)/2); SumArray right= new SumArray(arr,(hi+lo)/2,hi); left.fork(); // fork a thread and calls compute int rightAns = right.compute();//call compute directly int leftAns = left.join(); // get result from left return leftAns + rightAns; } static final ForkJoinPool fjPool = new ForkJoinPool(); int sum(int[] arr){ return fjPool.invoke(new SumArray(arr,0,arr.length)); // invoke returns the value compute returns