Cilk and OpenMP (Lecture 20, cs262a)

Slides:



Advertisements
Similar presentations
© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
Advertisements

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Parallel Programming in Java with Shared Memory Directives.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
Threaded Programming Lecture 4: Work sharing directives.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Tuning Threaded Code with Intel® Parallel Amplifier.
MPI and OpenMP (Lecture 25, cs262a)
CILK: An Efficient Multithreaded Runtime System
Processes and threads.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
CMPS 5433 Programming Models
CS427 Multicore Architecture and Parallel Computing
Loop Parallelism and OpenMP CS433 Spring 2001
Computer Engg, IIT(BHU)
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Exploiting Parallelism
Computer Science Department
Shared Memory Programming with OpenMP
Multi-core CPU Computing Straightforward with OpenMP
Changing thread semantics
Multithreaded Programming in Cilk LECTURE 1
Programming with Shared Memory Introduction to OpenMP
Introduction to CILK Some slides are from:
Distributed Systems CS
Prof. Leonardo Mostarda University of Camerino
Introduction to OpenMP
Multithreading Why & How.
Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.
Cilk and Writing Code for Hardware
Programming with Shared Memory Specifying parallelism
Introduction to CILK Some slides are from:
OpenMP Parallel Programming
Shared-Memory Paradigm & OpenMP
Parallel Programming with OPENMP
Presentation transcript:

Cilk and OpenMP (Lecture 20, cs262a) Ion Stoica, UC Berkeley April 4, 2018 3-D graph Checklist What we want to enable What we have today How we’ll get there

Today’s papers “Cilk: An Efficient Multithreaded Runtime System” by Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall and Yuli Zhou http://supertech.csail.mit.edu/papers/PPoPP95.pdf “OpenMP: An IndustryStandard API for SharedMemory Programming”, Leonardo Dagum and Ramesh Menon https://ucbrise.github.io/cs262a-spring2018/

Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG Shared Memory MSG IPC Message passing: exchange data explicitly via IPC Application developers define protocol and exchanging format, number of participants, and each exchange Shared memory: all multiple processes to share data via memory Applications must locate and and map shared memory regions to exchange data

Architectures Orthogonal to programming model Uniformed Shared Memory (UMA) Cray 2 Non-Uniformed Shared Memory (NUMA) SGI Altix 3700 Massively Parallel DistrBluegene/L Orthogonal to programming model

Motivation Multicore CPUs are everywhere: Servers with over 100 cores today Even smartphone CPUs have 8 cores Multithreading, natural programming model All processors share the same memory Threads in a process see same address space Many shared-memory algorithms developed

But… Multithreading is hard Lots of expertise necessary Deadlocks and race conditions Non-deterministic behavior makes it hard to debug

Example Parallelize the following code using threads: for (i=0; i<n; i++) { sum = sum + sqrt(sin(data[i])); } Why hard? Need mutex to protect the accesses to sum Different code for serial and parallel version No built-in tuning (# of processors?)

Cilk Based on slides available at http://supertech.csail.mit.edu/cilk/lecture-1.pdf

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors Examples: dense and sparse matrix computations n-body simulation heuristic search graphics rendering …

Cilk in one slide Simple extension to C; just three basic keywords Cilk programs maintain serial semantics Abstracts away parallel execution, load balancing and scheduling Parallelism Processor-oblivious Speculative execution Provides performance “guarantees”

Example: Fibonacci C (elision) Click int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } A Cilk program’s serial elision is always a legal implementation of Cilk semantics Cilk provides no new data types.

Cilk basic kewords Identifies a function as a Cilk procedure, capable of being spawned in parallel. click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } The named child Cilk procedure can execute in parallel with the parent caller Control cannot pass this point until all spawned children have returned

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } Initial thread Computation unfolds dynamically

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } Initial thread Execution represented as a graph, G = (V, E) Each vertex v ∈ V represents a (Cilk) thread: a maximal sequence of instructions not containing parallel control (spawn, sync, return) Every edge e ∈ E is either a spawn, a return or a continue edge

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } Spawn edge 3

Dynamic multithreading – example: fib(4) Continue edge 4 click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } 3 2

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } 3 2 2 1 1 1

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } 3 2 2 1 1 1 1 1

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } 3 2 2 1 1 1 Return edge 1 1

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } 3 2 2 1 1 1 1 1

Dynamic multithreading – example: fib(4) click int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync return (x+y); } Final thread 3 2 2 1 1 1 1 1

Cactus stack A stack pointer can be passed from parent to child, but not from child to parent Support several views of stack Cilk also supports malloc A B D E C A A A A A A B B B C B C D E D E

Algorithmic complexity TP = execution time on P processors

Algorithmic complexity TP = execution time on P processors T1 = total work

Algorithmic complexity TP = execution time on P processors T1 = total work TC = critical path length (span)

Algorithmic complexity TP = execution time on P processors T1 = total work TC = critical path length (span) Lower bounds TP >= T1/P TP >= TC

Speedup T1/TP = speedup on P processors. If T1/TP = Θ(P) <= P, linear speedup If T1/TP = P , perfect linear speedup

Parallelism Since TP >= TC , T1/TC is maximum speedup T1/TP = prallelism, average amount of work per step along the span

Little sense to use more than 2 processors! Example: fib(4) Assume each thread takes 1 time unit: Work: T1 = 17 Span: TC = 8 Parallelism: T1/TC = 17/8 1 8 2 7 3 4 6 Little sense to use more than 2 processors! 5

Example: vector addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

Example: vector addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } Key idea: convert loops to recursion..

Example: vector addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } Key idea: convert loops to recursion.. void vadd (real *A, real *B, int n) { if (n<=BASE) { } else { vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2);

Example: vector addition cilk void vadd (real *A, real *B, int n) { if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } BASE

Example: vector addition Assume BASE = Θ(1): Work: T1 = Θ(n) Span: TC = Θ(log n) Parallelism: T1/TC = Θ(n/log n) BASE

Scheduling Cilk scheduler maps Cilk threads onto processors dynamically at runtime P P … P Network Memory I/O

Greedy scheduling Key idea: Do as much as possible on every step Definition: A thread is ready if all its predecessors have executed

Greedy sechduling P = 3 Key idea: Do as much as possible on every step Definition: A thread is ready if all its predecessors have executed Complete step If # ready threads >= P, run any P ready threads

Greedy scheduling P = 3 Key idea: Do as much as possible on every step Definition: A thread is ready if all its predecessors have executed Complete step If # ready threads >= P, run any P ready threads Incomplete step If # ready threads < P run all P ready threads

Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75] Any greedy scheduler achieves TP ≤ T1/P + TC Proof sketch: # complete steps ≤ T1/P since each complete step performs P work P = 3

Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75] Any greedy scheduler achieves TP ≤ T1/P + TC Proof sketch: # complete steps ≤ T1/P since each complete step performs P work # incomplete steps ≤ TC, since each incomplete step reduces the span of the unexecuted DAG by 1 P = 3

Optimality of greedy Corollary: Any greedy scheduler is within a factor of 2 of optimal Proof: Let TP* be execution time produced by optimal scheduler Since TP* ≥ max{T1/P, TC} (lower bounds), we have TP ≤ T1/P + TC ≤ 2 max{T1/P, TC} ≤ 2 TP*

Linear speedup Corollary: Any greedy scheduler achieves near-perfect linear speedup whenever P << T1/TC. Proof: From P << T1/TC we get TC << T1/P From the Greedy Scheduling Theorem gives us TP ≤ T1/P + TC ≈ T1/P Thus, speedup is T1/TP ≈ P

Work stealing Each processor has a queue of threads to run A spawned thread is put on local processor queue When a processor runs out of work, it looks at queues of other processors and "steals" their work items Typically pick the processor from where to steal randomly

Cilk performance Cilk’s “work-stealing” scheduler achieves TP = T1/P + O(TC) expected time (provably); TP ≈ T1/P + O(TC) time (empirically). Near-perfect linear speedup if P << T1/TC Instrumentation in Cilk allows to accurately measure T1 and TC The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform

Summary C extension for multithreaded programs Now available also for C++: Cilk Plus from Intel Simple; only three keywords: cilk, spawn, sync Cilk Plus has actually only two: cilk_spaw, cilk_sync Equivalent to serial program on a single core Abstracts away parallelism, load balancing and scheduling Leverages recursion pattern Might need to rewrite programs to fit the pattern (see vector addition example)

Based on the “Introduction to OpenMP” presentation: (https://webcourse.cs.technion.ac.il/236370/Spring2009/ho/WCFiles/OpenMPLecture.ppt)

OpenMP A language extension with constructs for parallel programming: Critical sections, atomic access, private variables, barriers Parallelization is orthogonal to functionality If the compiler does not recognize OpenMP directives, the code remains functional (albeit single-threaded) Industry standard: supported by Intel, Microsoft, IBM, HP

OpenMP execution model Fork and Join: Master thread spawns a team of threads as needed Worker Thread Master thread Master thread FORK JOIN FORK JOIN Parallel regions

OpenMP memory model Shared memory model Threads communicate by accessing shared variables The sharing is defined syntactically Any variable that is seen by two or more threads is shared Any variable that is seen by one thread only is private Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize the synchronization

OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1 != answer2) { … } How to parallelize?

OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1 != answer2) { … } How to parallelize? #pragma omp sections { #pragma omp section }

OpenMP: Work sharing example Sequential code for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } Launch nt threads Each thread uses id and nt variables to operate on a different segment of the arrays #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization Automatic parallelization of the for loop using #parallel for for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } } Increment: var++, var--, var += incr, var -= incr Comparison: var op last, where op: <, >, <=, >= One signed variable in the loop (“i”) #pragma omp parallel #pragma omp for schedule(static) { for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } } Initialization: var = init

Challenges of #parallel for Load balancing If all iterations execute at the same speed, the processors are used optimally If some iterations are faster, some processors may get idle, reducing the speedup We don’t always know distribution of work, may need to re-distribute dynamically Granularity Thread creation and synchronization takes time Assigning work to threads on per-iteration resolution may take more time than the execution itself Need to coalesce the work to coarse chunks to overcome the threading overhead Trade-off between load balancing and granularity of parallelism

Schedule: controlling work distribution schedule(static [, chunksize]) Default: chunks of approximately equivalent size, one to each thread If more chunks than threads: assigned in round-robin to the threads Why might want to use chunks of different size? schedule(dynamic [, chunksize]) Threads receive chunk assignments dynamically Default chunk size = 1 schedule(guided [, chunksize]) Start with large chunks Threads receive chunks dynamically. Chunk size reduces exponentially, down to chunksize

OpenMP: Data Environment Shared Memory programming model Most variables (including locals) are shared by threads { int sum = 0; #pragma omp parallel for for (int i=0; i<N; i++) sum += i; } Global variables are shared Some variables can be private Variables inside the statement block Variables in the called functions Variables can be explicitly declared as private

Overriding storage attributes private: A copy of the variable is created for each thread There is no connection between original variable and private copies Can achieve same using variables inside { } firstprivate: Same, but the initial value of the variable is copied from the main copy lastprivate: Same, but the last value of the variable is copied to the main copy int i; #pragma omp parallel for private(i) for (i=0; i<n; i++) { … } int idx=1; int x = 10; #pragma omp parallel for \ firsprivate(x) lastprivate(idx) for (i=0; i<n; i++) { if (data[i] == x) idx = i; }

Reduction How to parallelize this code? Use the reduction clause for (j=0; j<N; j++) { sum = sum + a[j]*b[j]; } How to parallelize this code? sum is not private, but accessing it atomically is too expensive Have a private copy of sum in each thread, then add them up Use the reduction clause #pragma omp parallel for reduction(+: sum) Any associative operator could be used: +, -, ||, |, *, etc The private value is initialized automatically (to 0, 1, ~0 …)

#pragma omp reduction float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction(+:sum) for(int i = 0; i < N; i++) { sum += a[i] * b[i]; } return sum;

Summary OpenMP: A framework for code parallelization Available for C++ and FORTRAN Provides control on parallelism Implementations from a wide selection of vendors Relatively easy to use Write (and debug!) code first, parallelize later Parallelization can be incremental Parallelization can be turned off at runtime or compile time Code is still correct for a serial machine

Cilk vs OpenMP Cilk: OpenMP: Simpler More natural for unstructured programs OpenMP: Provide more control to developer (e.g., set “cunck” size) More natural to parallelize for() statemetns Wider support from more vendors; also extension for FORTRANT “If your code looks like a sequence of parallelizable Fortran-style loops, OpenMP will likely give good speedups. If your control structures are more involved, in particular, involving nested parallelism, you may find that OpenMP isn’t quite up to the job” -- http://www.cilk.com/multicore-blog/bid/8583/Comparing-Cilk-and-OpenMP