Introduction to Parallel Processing Dr. Guy Tel-Zur Lecture 10.

Slides:



Advertisements
Similar presentations
© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
Advertisements

Parallel Processing with OpenMP
Introductions to Parallel Programming Using OpenMP
Introduction to Parallel Processing Guy Tel-Zur Lecture 8.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Computational Physics Lecture 10 Dr. Guy Tel-Zur.
Computational Physics Lecture 4 Dr. Guy Tel-Zur.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
1 Datamation Sort 1 Million Record Sort using OpenMP and MPI Sammie Carter Department of Computer Science N.C. State University November 18, 2004.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Lecture 8: Caffe - CPU Optimization
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006outline.1 ITCS 4145/5145 Parallel Programming (Cluster Computing) Fall 2006 Barry Wilkinson.
Scientific Computing Lecture 5 Dr. Guy Tel-Zur Autumn Colors, by Bobby Mikul, Mikul Autumn Colors, by Bobby Mikul,
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson Dec 24, 2012outline.1 ITCS 4010/5010 Topics in Computer Science: GPU Programming for High Performance.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Introduction to OpenMP
Introduction to Parallel Processing Dr. Guy Tel-Zur Lecture 6.
ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.
Parallel Systems Lecture 10 Dr. Guy Tel-Zur. Administration Home assignments status Final presentation status – Open Excel file ps2013a.xlsx Allinea DDT.
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Symbolic Analysis of Concurrency Errors in OpenMP Programs Presented by : Steve Diersen Contributors: Hongyi Ma, Liqiang Wang, Chunhua Liao, Daniel Quinlen,
Tuning Threaded Code with Intel® Parallel Amplifier.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Comparison of Threading Programming Models
CONCURRENCY PLATFORMS
Introduction to OpenMP
Lecture 5: Shared-memory Computing with Open MP
Loop Parallelism and OpenMP CS433 Spring 2001
Introduction to OpenMP
Using compiler-directed approach to create MPI code automatically
Dr. Barry Wilkinson © B. Wilkinson Modification date: Jan 9a, 2014
Multithreaded Programming in Cilk LECTURE 1
Programming with Shared Memory Introduction to OpenMP
Lecture 10 Dr. Guy Tel-Zur.
Introduction to CILK Some slides are from:
Hybrid Parallel Programming
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Programming with Shared Memory
Hybrid Parallel Programming
Introduction to OpenMP
Hybrid MPI and OpenMP Parallel Programming
Introduction to Parallel Computing
Cilk and Writing Code for Hardware
Hybrid Parallel Programming
Introduction to CILK Some slides are from:
Presentation transcript:

Introduction to Parallel Processing Dr. Guy Tel-Zur Lecture 10

Agenda Administration Final presentations Demos Theory Next week plan Home assignment #4 (last)

Final Projects Next Sunday: Groups 1-16 will present Next Monday: Groups 17+ will present 10 minutes presentation per group All group members should present Send to: your presentation by midnight of the previous נוכחות חובה

Final Presentations החלוקה לקבוצות הינה קשיחה קבוצה שלא תציג תאבד 5 נקודות בציון יש לבצע חזרה ולוודא עמידה בזמנים המצגת צריכה לכלול : שם הפרויקט, מטרתו, האתגר בבעיה מבחינת החישוב המקבילי, דרכים לפתרון. לא תתקבלנה מצגות בזמן השיעור ! יש להקפיד לשלוח אותן אל המרצה מבעוד מועד

The Course Roadmap Introduction Message Passing HTC HPC Shared Memory Condor Grid Computing Cloud Computing MPIOpenMPCilk++ Today GPU Computing New! Today

Advanced Parallel Computing and Distributed Computing course A new course at the department: Distributed Computing: Advanced Parallel Processing course + Grid Computing + Cloud Computing Course Number: If you are interested in this course please send me an

Today Algorithms – Numerical Algorithms (“slides11.ppt”) Introduction to Grid Computing Some demos Home assignment #4

Futuristic A-Symmetric Multi-Core Chip SACC Sequential Accelerator

Theory Numerical Algorithms – Slides from: UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE Department of Computer Science ITCS 4145/5145 Parallel Programming Spring 2009 Dr. Barry Wilkinson Dr. Barry Wilkinson Matrix multiplication, solving a system of linear equations, iterative methods URL is HereHere

Demos Hybrid Parallel Programming – MPI + OpenMP Cloud Computing – Setting a HPC cluster – Setting a Condor machine (a separate presentation) StarHPC Cilk++ GPU Computing (a separate presentation) Eclipse PTP Kepler workflow

Hybrid MPI + OpenMP Demo Machine File: hobbit1 hobbit2 hobbit3 hobbit4 Each hobbit has 8 cores mpicc -o mpi_out mpi_test.c -fopenmp MPI OpenMP An Idea for a final project!!! cd ~/mpi program name: hybridpi.c

MPI is not installed yet on the hobbits, in the meanwhile: vdwarf5 vdwarf6 vdwarf7 vdwarf8

top -u tel-zur -H -d 0.05 H – show threads, d – delay for refresh, u - user

Hybrid MPI+OpenMP continued

Hybrid Pi (MPI+OpenMP #include #define NBIN #define MAX_THREADS 8 int main(int argc,char **argv) { int nbin,myid,nproc,nthreads,tid; double step,sum[MAX_THREADS]={0.0},pi=0.0,pig; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Comm_size(MPI_COMM_WORLD,&nproc); nbin = NBIN/nproc; step = 1.0/(nbin*nproc);

#pragma omp parallel private(tid) { int i; double x; nthreads = omp_get_num_threads(); tid = omp_get_thread_num(); for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) { x = (i+0.5)*step; sum[tid] += 4.0/(1.0+x*x); } printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]); } for(tid=0; tid<nthreads; tid++) pi += sum[tid]*step; MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); if (myid==0) printf("PI = %f\n",pig); MPI_Finalize(); return 0; }

Cilk++ Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution

17/8/2011

Fibonachi (Fibonacci) Try:

Fibonachi Numbers serial version // 1, 1, 2, 3, 5, 8, 13, 21, 34,... // Serial version // Credit: long fib_serial(long n) { if (n < 2) return n; return fib_serial(n-1) + fib_serial(n-2); } // 1, 1, 2, 3, 5, 8, 13, 21, 34,... // Serial version // Credit: long fib_serial(long n) { if (n < 2) return n; return fib_serial(n-1) + fib_serial(n-2); }

Cilk++ Fibonachi (Fibonacci) #include long fib_parallel(long n) { long x, y; if (n < 2) return n; x = cilk_spawn fib_parallel(n-1); y = fib_parallel(n-2); cilk_sync; return (x+y); } int cilk_main() { int N=50; long result; result = fib_parallel(N); printf("fib of %d is %d\n",N,result); return 0; }

Cilk_spawn ADD PARALLELISM USING CILK_SPAWN We are now ready to introduce parallelism into our qsort program. The cilk_spawn keyword indicates that a function (the child) may be executed in parallel with the code that follows the cilk_spawn statement (the parent). Note that the keyword allows but does not require parallel operation. The Cilk++ scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The cilk_sync statement indicates that the function may not continue until all cilk_spawn requests in the same function have completed. cilk_sync does not affect parallel strands spawned in other functions.

Cilkview Fn(30)

Strands and Knots A Cilk++ program fragments... do_stuff_1(); // execute strand 1 cilk_spawn func_3(); // spawn strand 3 at knot A do_stuff_2(); // execute strand 2 cilk_sync; // sync at knot B do_stuff_4(); // execute strand 4... DAG with two spawns (labeled A and B) and one sync (labeled C) 

Let's add labels to the strands to indicate the number of milliseconds it takes to execute each strand a more complex Cilk++ program (DAG): In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited number of processors are available, this program should run for 68 milliseconds.

Work and Span Work The total amount of processor time required to complete the program is the sum of all the numbers. We call this the work. In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is run on a single processor, the program should run for 181 milliseconds. Span Another useful concept is the span, sometimes called the critical path length. The span is the most expensive path that goes from the beginning to the end of the program. In this DAG, the span is 68 milliseconds, as shown below:

divide-and-conquer strategy cilk_for Shown here: 8 threads and 8 iterations Here is the DAG for a serial loop that spawns each iteration. In this case, the work is not well balanced, because each child does the work of only one iteration before incurring the scheduling overhead inherent in entering a sync.

Race conditions Check the “qsort-race” program with cilkscreen:

StarHPC on the Cloud Will be ready for PP201X?

Eclipse PTP Parallel Tools Platform Will be ready for PP201X?

Recursion in OpenMP long fib_parallel(long n) { long x, y; if (n < 2) return n; #pragma omp task default(none) shared(x,n) { x = fib_parallel(n-1); } y = fib_parallel(n-2); #pragma omp taskwait return (x+y); } #pragma omp parallel #pragma omp single { r = fib_parallel(n); } Reference: Use the taskwait pragma to specify a wait for child tasks to be completed that are generated by the current task. The task pragma can be useful for parallelizing irregular algorithms such as recursive algorithms for which other OpenMP workshare constructs are inadequate.

Intel® Parallel Studio Use Parallel Composer to create and compile a parallel application Use Parallel Inspector to improve reliability by finding memory and threading errors Use Parallel Amplifier to improve parallel performance by tuning threaded code

Intel® Parallel Studio

Parallel Studio add new features to Visual Studio

Intel’s Parallel Amplifier – Execution Bottlenecks

Intel’s Parallel Inspector – Threading Errors

Error – Data Race

Intel Parallel Studio - Composer The installation of this part failed for me. Probably because I didn’t install before Intel’s C++ compiler. Sorry I can’t make a demo here…