CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, 2012 10/04/2012 CS4230.

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
10/07/2010CS4961 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Mary Hall October 7,
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Introduction to OpenMP
Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Threaded Programming Lecture 2: Introduction to OpenMP.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Lecture 20: Consistency Models, TM
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
GC211Data Structure Lecture2 Sara Alhajjam.
Loop Parallelism and OpenMP CS433 Spring 2001
Atomic Operations in Hardware
Computer Engg, IIT(BHU)
Introduction to OpenMP
Exploiting Parallelism
CS427 Multicore Architecture and Parallel Computing
Queues Chapter 8 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved
Sieve of Eratosthenes.
Pipelined Computations
PRAM Algorithms.
CMSC 341 Lecture 5 Stacks, Queues
Decomposition Data Decomposition Functional Decomposition
Pipelined Computations
Introduction to High Performance Computing Lecture 20
Algorithm Discovery and Design
Programming with Shared Memory Introduction to OpenMP
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Lecture 22: Consistency Models, TM
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.
CS210- Lecture 5 Jun 9, 2005 Agenda Queues
Introduction to OpenMP
Programming with Shared Memory Specifying parallelism
Lecture: Consistency Models, TM
Shared-Memory Paradigm & OpenMP
Presentation transcript:

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, 2012 10/04/2012 CS4230

Homework 3: Due Before Class, Thurs. Oct. 18 handin cs4230 hw3 <file> Problem 1 (Amdahl’s Law): (i) Assuming a 10 percent overhead for a parallel computation, compute the speedup of applying 100 processors to it, assuming that the overhead remains constant as processors are added. (ii) Given this speedup, what is the efficiency? (iii.) Using this efficiency, suppose each processor is capable of 10 Gflops peak performance. What is the best performance we can expect for this computation in Gflops? Problem 2 (Data Dependences): Does the following code have a loop-carried dependence? If so, identify the dependence or dependences and which loop carries it/them. for (i=1; i < n-1; i++) for (j=1; j < n-1; j++)     A[i][j] = A[i][j-1] + A[i-1][j+1]; 10/04/2012 CS4230

Homework 3, cont. Problem 3 (Locality): Sketch out how to rewrite the following code to improve its cache locality and locality in registers. Assume row-major access order. for (i=1; i<n; i++) for (j=1; j<n; j++) a[j][i] = a[j-1][i-1] + c[j]; Briefly explain how your modifications have improved memory access behavior of the computation. 10/04/2012 CS4230

Homework 3, cont. Problem 4 (Task Parallelism): Construct a producer-consumer pipelined code in OpenMP to identify the set of prime numbers in the sequence of integers from 1 to n. A common sequential solution to this problem is the sieve of Erasthones. In this method, a series of all integers is generated starting from 2. The first number, 2, is prime and kept. All multiples of 2 are deleted because they cannot be prime. This process is repeated with each remaining number, up until but not beyond sqrt(n). A possible sequential implementation of this solution is as follows: for (i=2; i<=n; i++) prime[i] = true; // initialize for (i=2; i<= sqrt(n); i++) { if (prime[i]) { for (j=i+i; j<=n; j = j+i) prime[j] = false; } The parallel code can operate on different values of i. First, a series of consecutive numbers is generated that feeds into the first pipeline stage. This stage eliminates all multiples of 2 and passes remaining numbers onto the second stage, which eliminates all multiples of 3, etc. The parallel code terminates when the “terminator” element arrives at each pipeline stage. 10/04/2012 CS4230

General: Task Parallelism Recall definition: A task parallel computation is one in which parallelism is applied by performing distinct computations – or tasks – at the same time. Since the number of tasks is fixed, the parallelism is not scalable. OpenMP support for task parallelism Parallel sections: different threads execute different code Tasks (NEW): tasks are created and executed at separate times Common use of task parallelism = Producer/consumer A producer creates work that is then processed by the consumer You can think of this as a form of pipelining, like in an assembly line The “producer” writes data to a FIFO queue (or similar) to be accessed by the “consumer” 10/04/2012 CS4230

Simple: OpenMP sections directive #pragma omp parallel { #pragma omp sections #pragma omp section {{ a=...; b=...; } { c=...; d=...; } { e=...; f=...; } { g=...; h=...; } } /*omp end sections*/ } /*omp end parallel*/ 10/04/2012 CS4230

Parallel Sections, Example #pragma omp parallel shared(n,a,b,c,d) private(i) { #pragma omp sections nowait #pragma omp section for (i=0; i<n; i++) d[i] = 1.0/c[i]; for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; } /*-- End of sections --*/ } /*-- End of parallel region 10/04/2012 CS4230

Simple Producer-Consumer Example // PRODUCER: initialize A with random data void fill_rand(int nval, double *A) { for (i=0; i<nval; i++) A[i] = (double) rand()/1111111111; } // CONSUMER: Sum the data in A double Sum_array(int nval, double *A) { double sum = 0.0; for (i=0; i<nval; i++) sum = sum + A[i]; return sum; 10/04/2012 CS4230

Key Issues in Producer-Consumer Parallelism Producer needs to tell consumer that the data is ready Consumer needs to wait until data is ready Producer and consumer need a way to communicate data output of producer is input to consumer Producer and consumer often communicate through First-in-first-out (FIFO) queue 10/04/2012 CS4230

One Solution to Read/Write a FIFO The FIFO is in global memory and is shared between the parallel threads How do you make sure the data is updated? Need a construct to guarantee consistent view of memory Flush: make sure data is written all the way back to global memory Example: Double A; A = compute(); Flush(A); 10/04/2012 CS4230

Solution to Producer/Consumer flag = 0; #pragma omp parallel { #pragma omp section fillrand(N,A); #pragma omp flush flag = 1; #pragma omp flush(flag) } while (!flag) sum = sum_array(N,A); 10/04/2012 CS4230

Reminder: What is Flush? Specifies that all threads have the same view of memory for all shared objects. Flush(var) Only flushes the shared variable “var” 10/04/2012 CS4230

Another (trivial) producer-consumer example for (j=0; j<M; j++) { sum[j] = 0; for(i = 0; i < size; i++) { // TASK 1: scale result out[i] = _iplist[j][i]*(2+i*j); // TASK 2: compute sum sum[j] += out[i]; } // TASK 3: compute average and compare with max res = sum[j] / size; if (res > maxavg) maxavg = res; return maxavg; 10/04/2012 CS4230

Another Example from Textbook Implement Message-Passing on a Shared-Memory System for Producer-consumer A FIFO queue holds messages A thread has explicit functions to Send and Receive Send a message by enqueuing on a queue in shared memory Receive a message by grabbing from queue Ensure safe access 10/04/2012 CS4230

Message-Passing 10/04/2012 CS4230

Sending Messages Use synchronization mechanisms to update FIFO “Flush” happens implicitly What is the implementation of Enqueue? 10/04/2012 CS4230

Receiving Messages This thread is the only one to dequeue its messages. Other threads may only add more messages. Messages added to end and removed from front. Therefore, only if we are on the last entry is synchronization needed. 10/04/2012 CS4230

Termination Detection each thread increments this after completing its for loop More synchronization needed on “done_sending” 10/04/2012 CS4230

Task Example: Linked List Traversal ........ while(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; } // End of while loop How to express with parallel for? Must have fixed number of iterations Loop-invariant loop condition and no early exits 10/04/2012 CS4230

OpenMP 3.0: Tasks! my_pointer = listhead; #pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier here firstprivate = private and copy initial value from global variable lastprivate = private and copy back final value to global variable 10/04/2012 CS4230