Introduction to CILK Some slides are from:

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
MATH 224 – Discrete Mathematics
1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Introduction to Parallel Processing Dr. Guy Tel-Zur Lecture 10.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Parallelizing C Programs Using Cilk Mahdi Javadi.
1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Recursion Review.
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
111 © 2002, Cisco Systems, Inc. All rights reserved.
Data Structures R e c u r s i o n. Recursive Thinking Recursion is a problem-solving approach that can be used to generate simple solutions to certain.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
CS 240A: Shared Memory & Multicore Programming with Cilk++
Constructs for Data Organization and Program Control, Scope, Binding, and Parameter Passing. Expression Evaluation.
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Maitrayee Mukerji. Factorial For any positive integer n, its factorial is n! is: n! = 1 * 2 * 3 * 4* ….* (n-1) * n 0! = 1 1 ! = 1 2! = 1 * 2 = 2 5! =
CONCURRENCY PLATFORMS
Recursion.
CILK: An Efficient Multithreaded Runtime System
Multithreading / Concurrency
CS 140 : Numerical Examples on Shared Memory with Cilk++
Lesson #6 Modular Programming and Functions.
Lesson #6 Modular Programming and Functions.
CMPS 5433 Programming Models
Prabhanjan Kambadur, Open Systems Lab, Indiana University
CS 326 Programming Languages, Concepts and Implementation
CS1061 C Prgramming Lecture 11: Functions
Course Description Algorithms are: Recipes for solving problems.
Recursive Thinking Chapter 9 introduces the technique of recursive programming. As you have seen, recursive programming involves spotting smaller occurrences.
Amir Kamil and Katherine Yelick
Recursive Thinking Chapter 9 introduces the technique of recursive programming. As you have seen, recursive programming involves spotting smaller occurrences.
Chapter 5 - Functions Outline 5.1 Introduction
Lesson #6 Modular Programming and Functions.
Recursion "To understand recursion, one must first understand recursion." -Stephen Hawking.
Chapter 10: Implementing Subprograms Sangho Ha
Cilk and OpenMP (Lecture 20, cs262a)
Complexity Measures for Parallel Computation
CS201: Data Structures and Discrete Mathematics I
Multithreaded Programming in Cilk LECTURE 1
CS 201 Fundamental Structures of Computer Science
Unit 3 Test: Friday.
Multithreaded Programming in Cilk Lecture 1
Lesson #6 Modular Programming and Functions.
Function.
Amir Kamil and Katherine Yelick
Introduction to Algorithm and its Complexity Lecture 1: 18 slides
Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.
CSE 153 Design of Operating Systems Winter 19
Complexity Measures for Parallel Computation
Cilk and Writing Code for Hardware
Course Description Algorithms are: Recipes for solving problems.
Introduction to CILK Some slides are from:
Function.
Presentation transcript:

Introduction to CILK Some slides are from: Kathy Yelick, yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f07 The rest of the slides are mine (Guy Tel-Zur) 1/3/2019 CS267 Lecture 2

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics rendering n-body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Guy: http://software.intel.com/en-us/intel-cilk-plus 1/3/2019 CS194 Lecture

Guy: http://cilkplus.org/ 1/3/2019 CS194 Lecture

Cilk integration with VS2008 1/3/2019 MTA Parallel Systems+CS194 Lecture

17/8/2011

קיימת גירסה חדשה יותר

Fibonachi (Fibonacci) Try: http://www.wolframalpha.com/input/?i=fibonacci+number

Fibonachi Numbers serial version // 1, 1, 2, 3, 5, 8, 13, 21, 34, ... // Serial version // Credit: http://myxman.org/dp/node/182 long fib_serial(long n) { if (n < 2) return n; return fib_serial(n-1) + fib_serial(n-2); }

Cilk++ Fibonachi (Fibonacci) #include <cilk.h> #include <stdio.h> long fib_parallel(long n) { long x, y; if (n < 2) return n; x = cilk_spawn fib_parallel(n-1); y = fib_parallel(n-2); cilk_sync; return (x+y); } int cilk_main() int N=50; long result; result = fib_parallel(N); printf("fib of %d is %d\n",N,result); return 0;

Cilk++ Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution http://software.intel.com/en-us/articles/intel-cilk-plus/

Cilk_spawn ADD PARALLELISM USING CILK_SPAWN The cilk_spawn keyword indicates that a function (the child) may be executed in parallel with the code that follows the cilk_spawn statement (the parent). Note that the keyword allows but does not require parallel operation. The Cilk++ scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The cilk_sync statement indicates that the function may not continue until all cilk_spawn requests in the same function have completed. cilk_sync does not affect parallel strands spawned in other functions.

Fibonacci Example: Creating Parallelism int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } C elision cilk int fib (int n) { x = spawn fib(n-1); y = spawn fib(n-2); sync; Cilk code Elision = השמטה Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Basic Cilk Keywords Identifies a function as a Cilk procedure, capable of being spawned in parallel. cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned. 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } Example: fib(4) 4 3 2 processors are virtualized 2 1 1 The computation dag unfolds dynamically. 1 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Demo: /users/agnon/misc/tel-zur/cilk -bash-4.1$ gcc -o fib ./fib.c -bash-4.1$ cilk++ -o fib_cilk ./fib_cilk.c -bash-4.1$ ./fib Fibonacci of 45 is 1134903170 -bash-4.1$ ./fib_cilk fibonacci of 45 is 1134903170 1/3/2019 CS194 Lecture

Algorithmic Complexity Measures TP = execution time on P processors T1 = work T∞ = span* LOWER BOUNDS TP ≥ T1/P TP ≥ T ∞ * Also called critical-path length or computational depth. 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Definition: T1/TP = speedup on P processors. If T1/TP = (P) ≤ P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP ≥ T1/P. T1/T∞ = parallelism = the average amount of work per step along the span. 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Strands and Knots A Cilk++ program fragments ... do_stuff_1(); // execute strand 1 cilk_spawn func_3(); // spawn strand 3 at knot A do_stuff_2(); // execute strand 2 cilk_sync; // sync at knot B do_stuff_4(); // execute strand 4 ... Strand = רצועה DAG with two spawns (labeled A and B) and one sync (labeled C) 

a more complex Cilk++ program (DAG): Let's add labels to the strands to indicate the number of milliseconds it takes to execute each strand In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited number of processors are available, this program should run for 68 milliseconds.

Work and Span Work The total amount of processor time required to complete the program is the sum of all the numbers. We call this the work. In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is run on a single processor, the program should run for 181 milliseconds. Span Another useful concept is the span, sometimes called the critical path length. The span is the most expensive path that goes from the beginning to the end of the program. In this DAG, the span is 68 milliseconds, as shown below:

divide-and-conquer strategy cilk_for Shown here: 8 threads and 8 iterations Here is the DAG for a serial loop that spawns each iteration. In this case, the work is not well balanced, because each child does the work of only one iteration before incurring the scheduling overhead inherent in entering a sync.

Using many more than 2 processors makes little sense. Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T1 = 17 Work: T1 = ? Using many more than 2 processors makes little sense. Span: T1 = ? Span: T ∞ = 8 Parallelism: T1/T∞ = 2.125 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilkview Fn(30) Show demo in VisualStudio 2008 C:\Users\telzur\Documents\BGU\Teaching\ParallelProcessing\PP2011A\Lectures\10\fibonachi\fibonachi\fibonachi.vcproj

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } Parallelization strategy: Convert loops to recursion. 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C ilk if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk void vadd (real *A, real *B, int n){ vadd (A, B, n/2; spawn vadd (A+n/2, B+n/2, n-n/2; spawn } sync; Parallelization strategy: Convert loops to recursion. Insert Cilk keywords. Side benefit: D&C is generally good for caches! 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Vector Addition cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Vector Addition Analysis To add two vectors of length n, where BASE = (1): Work: T1 = ? (n) Span: T ∞ = ? (lg n) Parallelism: T1/T ∞ = ? (n/lg n) BASE 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

עד כאן מצגת זו 1/3/2019 CS194 Lecture

Cilk’s solution to mutual exclusion is no better than anybody else’s. Cilk provides a library of spin locks declared with Cilk_lockvar. To avoid deadlock with the Cilk scheduler, a lock should only be held within a Cilk thread. I.e., spawn and sync should not be executed while a lock is held. Fortunately, Cilk’s control parallelism often mitigates the need for extensive locking. 1/3/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2