Parallelizing C Programs Using Cilk Mahdi Javadi.

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.

Tree Recursion Traditional Approach. Tree Recursion Consider the Fibonacci Number Sequence: Time: , 1, 1, 2, 3, 5, 8, 13, 21,... /

Fine-grain Task Aggregation and Coordination on GPUs

§3 Dynamic Programming Use a table instead of recursion 1. Fibonacci Numbers: F(N) = F(N – 1) + F(N – 2) int Fib( int N ) { if ( N

1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Lecture 7 : Parallel Algorithms (focus on sorting algorithms) Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture.

Fast Fourier Transform Lecture 6 Spoken Language Processing Prof. Andrew Rosenberg.

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)

Notes on the analysis of multiplication algorithms.. Dr. M. Sakalli, Marmara University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 CS 267 Tricks with Trees James Demmel

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.

The Fourier series A large class of phenomena can be described as periodic in nature: waves, sounds, light, radio, water waves etc. It is natural to attempt.

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

Fast Fourier Transform Irina Bobkova. Overview I. Polynomials II. The DFT and FFT III. Efficient implementations IV. Some problems.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.

Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)

Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

The power of logarithmic computations Jordi Cortadella Department of Computer Science.

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

Data Structures and Algorithms Stacks. Stacks are a special form of collection with LIFO semantics Two methods int push( Stack s, void *item ); - add.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Data Structure Introduction.

Numerical Algorithms Quiz questions ITCS4145/5145, Parallel Programming March 14, 2013.

P ARALLEL P ROCESSING F INAL P RESENTATION CILK Eliran Ben Moshe Neriya Cohen.

Foundations of Algorithms, Fourth Edition

Reducing number of operations: The joy of algebraic transformations CS498DHP Program Optimization.

Page :Algorithms in the Real World Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

1 Chapter 2 Algorithm Analysis Reading: Chapter 2.

Numerical Algorithms Chapter 11.

PERFORMANCE EVALUATIONS

CILK: An Efficient Multithreaded Runtime System

Lecture 3: Parallel Algorithm Design

CS 140 : Numerical Examples on Shared Memory with Cilk++

Lecture 8-1 : Parallel Algorithms (focus on sorting algorithms)

CMPS 5433 Programming Models

Prabhanjan Kambadur, Open Systems Lab, Indiana University

Introduction to complexity

The Fast Fourier Transform

Multithreaded Programming in Cilk LECTURE 1

Numerical Algorithms Quiz questions

Introduction to CILK Some slides are from:

Quiz Questions Parallel Programming Parallel Computing Potential

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.

 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix

The Fast Fourier Transform

Richard Anderson Lecture 14 Divide and Conquer

Cilk and Writing Code for Hardware

Quiz Questions Parallel Programming Parallel Computing Potential

Quiz Questions Parallel Programming Parallel Computing Potential

Quiz Questions Parallel Programming Parallel Computing Potential

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Topic: Divide and Conquer

Richard Anderson Lecture 14, Winter 2019 Divide and Conquer

Introduction to CILK Some slides are from:

Presentation transcript:

Parallelizing C Programs Using Cilk Mahdi Javadi

Cilk Language Cilk is a language for multithreaded parallel programming based on C. The programmer should not worry about scheduling the computation to run efficiently. There are three additional keywords: cilk, spawn and sync.

Example: Fibonacci Int fib (int n) { int x, y; if (n<2) return n; x = fib (n-1); y = fib (n-2); return x+y; } cilk Int fib (int n) { int x, y; if (n<2) return n; x = spawn fib (n-1); y = spawn fib (n-2); sync; return x+y; }

Performance Measures T p = execution time on P processors. T 1 is called work. T ∞ is called span. Obvious lower bounds: T p ≥ T 1 /P T p ≥ T ∞ p =T 1 /T ∞ is called parallelism. Using more than p processors makes little sense.

Cilk Compiler The file extension should be “.cilk”. Example: > cilkc -O3 fib.cilk -o fib To find the 30 th Fibonacci number using 4 CPUs: > fib --nproc 4 30 To collect timings of each processor and compute the span (not efficient): > cilkc -cilk-profile -cilk-span -O3 fib.cilk -o fib

Example: Matrix Multiplication Suppose we want to multiply two n by n matrices: We can recursively formulate the problem: i.e. one n by n matrix multiplication reduces to: 8 multiplications and for additions of (n/2) by (n/2) submatrices. ( C 11 C 12 C 21 C 22 ) = ( A 11 A 12 A 21 A 22 ). ( B 11 B 12 B 21 B 22 ) ( A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 )( C 11 C 12 C 21 C 22 ) =

Multiplication Procedure Mult(C, A, B, n) if (n = 1) C[1,1] = A[1,1].B[1,1] else { spawn Mult(C 11,A 11,B 11,n/2); … spawn Mult(C 22,A 21,B 12,n/2); spawn Mult(T 11,A 12,B 21,n/2); … spawn Mult(T 22,A 22,B 22,n/2); sync; Add(C,T,n); }

Addition Procedure Add(C,T,n) if (n = 1) C[1,1] = C[1,1]+T[1,1]; else { spawn Add(C 11,T 11,n/2); … spawn Add(C 22,T 22,n/2); sync; } T 1 (work) for addition = O(n 2 ). T ∞ (span) for addition = O(log(n)).

Complexity of Multiplication We know that matrix multiplication is O(n 3 ) hence T 1 (work) for multiplication = O(n 3 ). T ∞ : M ∞ (n) = M ∞ (n/2) + O(log(n)) = O(log 2 (n)). p = T 1 / T ∞ = O(n 3 ) / O(log 2 (n)). To multiply 1000 by 1000: p = 10 7 ( a lot of CPUs !!!)

Discrete Fourier Transform DFT(n,w,p,…)... t = w 2 mod p DFT(n/2,t,p,…); … w 1 = 1; for (i = 0; i < n/2; i++) { … a[i] = … w 1 = w 1.w mod p; } cilk DFT(n,w,p,…)... t = w 2 mod p spawn DFT(n/2,t,p,…); sync; … spawn ParCom(n,a,p,1,…); cilk ParCom(n,a,p,m,…) if (n <= 512) … spawn ParCom(n/2,a,p,1,…); m’ = m. w n/2 mod p; spawn ParCom(n/2,a+n/2,p,m’,…); sync;

Complexity of ParCom The sequential combining does n/2 multiplication. T ∞ (span) for ParCom: –T ∞ (n) = T ∞ (n/2) + O(log(n)) T ∞ (n) = O(log 2 (n)). –p = O(n/log 2 (n)). We run FFT on “stan” which has 4 CPUs. Thus p > 4 does not make sense, so we cut off the parallelism at some level of recursion to speed up the program.

Timings # processors Par time (ms) Speed up Sequential FFT: (ms)