1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.

Introduction to Algorithms Quicksort

MATH 224 – Discrete Mathematics

Distributed Systems CS

SE-292 High Performance Computing

1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.

Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

8a-1 Programming with Shared Memory Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Jan 4, 2013 slides8a.ppt.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 7 Chapter 4: Threads (cont)

MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.

Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.

CS 240A : Examples with Cilk++

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.

CS 240A: Complexity Measures for Parallel Computation.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.

Synchronization (Barriers) Parallel Processing (CS453)

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

© 2008 CILK ARTS, Inc.1 Executive Briefing: Multicore- Enabling SaaS Applications September 3, 2008 Cilk++, Cilk, Cilkscreen, and Cilk Arts are trademarks.

Chapter 12 Recursion, Complexity, and Searching and Sorting

Analysis of Algorithms

1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Parallelism without Concurrency Charles E. Leiserson MIT.

CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.

1 CS 240A : Divide-and-Conquer with Cilk++ Thanks to Charles E. Leiserson for some of these slides Divide & Conquer Paradigm Solving recurrences Sorting:

CS 240A: Shared Memory & Multicore Programming with Cilk++

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.

Parallel Computing Presented by Justin Reschke

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

329 3/30/98 CSE 143 Searching and Sorting [Sections 12.4, ]

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

Maitrayee Mukerji. Factorial For any positive integer n, its factorial is n! is: n! = 1 * 2 * 3 * 4* ….* (n-1) * n 0! = 1 1 ! = 1 2! = 1 * 2 = 2 5! =

Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)

Chapter 4 – Thread Concepts

CONCURRENCY PLATFORMS

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

CS 140 : Numerical Examples on Shared Memory with Cilk++

CMPS 5433 Programming Models

Chapter 4 – Thread Concepts

Computer Engg, IIT(BHU)

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Multithreaded Programming in Cilk LECTURE 1

Shared Memory Programming

Introduction to CILK Some slides are from:

Distributed Systems CS

Multithreaded Programming in Cilk Lecture 1

Chapter 4: Threads & Concurrency

Programming with Shared Memory

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.

- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts: multiprogramming, multiprocessing, multitasking,

Cilk and Writing Code for Hardware

Introduction to CILK Some slides are from:

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Presentation transcript:

1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++ as a concurrency platform Divide and conquer paradigm for Cilk++ Thanks to Charles E. Leiserson for some of these slides

2 Multicore Architecture Network … Memory I/O $$$ Chip Multiprocessor (CMP) core

3 cc-NUMA Architectures AMD 8-way Opteron Server A processor (CMP) with 2/4 cores Memory bank local to a processor

4 cc-NUMA Architectures ∙No Front Side Bus ∙Integrated memory controller ∙On-die interconnect among CMPs ∙Main memory is physically distributed among CMPs (i.e. each piece of memory has an affinity to a CMP) ∙NUMA: Non-uniform memory access.  For multi-socket servers only  Your desktop is safe (well, for now at least)  Triton nodes are also NUMA !

5 Desktop Multicores Today This is your AMD Shangai or Intel Core i7 (Nehalem) ! On-die interconnect Private cache: Cache coherence is required

6 Multithreaded Programming ∙A thread of execution is a fork of a computer program into two or more concurrently running tasks. ∙POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE ∙Assembly of shared memory programming ∙Programmer has to manually:  Create and terminating threads  Wait for threads to complete  Manage the interaction between threads using mutexes, condition variables, etc.

7 Concurrency Platforms Programming directly on PThreads is painful and error-prone. With PThreads, you either sacrifice memory usage or load-balance among processors A concurrency platform provides linguistic support and handles load balancing. Examples: Threading Building Blocks (TBB) OpenMP Cilk++ Ahh! Sigh!

8 Cilk vs. PThreads How will the following code execute in PThreads? In Cilk? for (i=1; i< ; i++) { spawn-or-fork foo(i); } sync-or-join; for (i=1; i< ; i++) { spawn-or-fork foo(i); } sync-or-join; What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo? This different is a liveness property: ∙Cilk threads are spawned lazily, “may” parallelism ∙PThreads are spawned eagerly, “must” parallelism

9 Cilk vs. OpenMP ∙Cilk++ guarantees space bounds. On P processors, Cilk++ uses no more than P times the stack space of a serial execution. ∙Cilk++ has serial semantics. ∙Cilk++ has a solution for global variables (a construct called "hyperobjects") ∙Cilk++ has nested parallelism that works and provides guaranteed speed-up. ∙Cilk++ has a race detector for debugging and software release.

10 Great, how do we program it? ∙Cilk++ is a faithful extension of C++ ∙Programmer implement algorithms mostly in the divide-and-conquer (DAC) paradigm. Two hints to the compiler:  cilk_spawn: the following function can run in parallel with the caller.  cilk_sync: all spawned children must return before program execution can continue ∙Third keyword for programmer convenience only (compiler converts it to spawns/syncs under the covers)  cilk_for

11 The named child function may execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned. Example: Quicksort Nested Parallelism

12 Cilk++ Loops ∙A cilk_for loop’s iterations execute in parallel. ∙The index must be declared in the loop initializer. ∙The end condition is evaluated exactly once at the beginning of the loop. ∙Loop increments should be a const value cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { B[i][j] = A[j][i]; } Example: Matrix transpose

13 Serial Correctness Cilk++ source Conventional Regression Tests Conventional Regression Tests Reliable Single- Threaded Code Cilk++ Compiler Conventional Compiler Binary Linker int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } Serialization int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } Cilk++ Runtime Library Cilk++ Runtime Library The serialization is the code with the Cilk++ keywords replaced by null or C++ keywords. Serial correctness can be debugged and verified by running the multithreaded code on a single processor.

14 Serialization #ifdef CILKPAR #include #else #define cilk_for for #define cilk_main main #define cilk_spawn #define cilk_sync #endif #ifdef CILKPAR #include #else #define cilk_for for #define cilk_main main #define cilk_spawn #define cilk_sync #endif  cilk++ -DCILKPAR –O2 –o parallel.exe main.cpp  g++ –O2 –o serial.exe main.cpp How to seamlessly switch between serial c++ and parallel cilk++ programs? Add to the beginning of your program Compile !

15 int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } Parallel Correctness Cilk++ source Binary Reliable Multi- Threaded Code Cilkscreen Race Detector Parallel Regression Tests Parallel Regression Tests Linker Parallel correctness can be debugged and verified with the Cilkscreen race detector, which guarantees to find inconsistencies with the serial code quickly.

16 Race Bugs Definition. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. int x = 0; cilk_for(int i=0, i<2, ++i) { x++; } assert(x == 2); A A B B C C D D x++; int x = 0; assert(x == 2); x++; A A B B C C D D Example Dependency Graph

17 Race Bugs r1 = x; r1++; x = r1; r2 = x; r2++; x = r2; x = 0; assert(x == 2); Definition. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. x++; int x = 0; assert(x == 2); x++; A A B B C C D D

18 Types of Races Two sections of code are independent if they have no determinacy races between them. Suppose that instruction A and instruction B both access a location x, and suppose that A∥B (A is parallel to B).

19 Avoiding Races cilk_spawn qsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync;  All the iterations of a cilk_for should be independent.  Between a cilk_spawn and the corresponding cilk_sync, the code of the spawned child should be independent of the code of the parent, including code executed by additional spawned or called children. Note: The arguments to a spawned function are evaluated in the parent before the spawn occurs. Ex.

20 Cilkscreen ∙ Cilk screen runs off the binary executable:  Compile your program with the –fcilkscreen option to include debugging information.  Go to the directory with your executable and execute cilkscreen your_program [options]  Cilk screen prints information about any races it detects. ∙For a given input, Cilk screen mathematically guarantees to localize a race if there exists a parallel execution that could produce results different from the serial execution. ∙It runs about 20 times slower than real-time.

21 T P = execution time on P processors T 1 = workT ∞ = span* *Also called critical-path length or computational depth. W ORK L AW ∙T P ≥T 1 /P S PAN L AW ∙T P ≥ T ∞ Complexity Measures

22 Work: T 1 (A∪B) = Series Composition A A B B Work: T 1 (A∪B) = T 1 (A) + T 1 (B) Span: T ∞ (A∪B) = T ∞ (A) +T ∞ (B)Span: T ∞ (A∪B) =

23 Parallel Composition A A B B Span: T ∞ (A∪B) = max{T ∞ (A), T ∞ (B)}Span: T ∞ (A∪B) = Work: T 1 (A∪B) =Work: T 1 (A∪B) = T 1 (A) + T 1 (B)

24 Def. T 1 /T P = speedup on P processors. If T 1 /T P =  (P), we have linear speedup, = P, we have perfect linear speedup, > P, we have superlinear speedup, which is not possible in this performance model, because of the Work Law T P ≥ T 1 /P. Speedup

25 Parallelism Because the Span Law dictates that T P ≥ T ∞, the maximum possible speedup given T 1 and T ∞ is T 1 /T ∞ =parallelism =the average amount of work per step along the span.

26 Three Tips on Parallelism 1.Minimize the span to maximize parallelism. Try to generate 10 times more parallelism than processors for near-perfect linear speedup. 2.If you have plenty of parallelism, try to trade some if it off for reduced work overheads. 3.Use divide-and-conquer recursion or parallel loops rather than spawning one small thing off after another. for (int i=0; i<n; ++i) { cilk_spawn foo(i); } cilk_sync; for (int i=0; i<n; ++i) { cilk_spawn foo(i); } cilk_sync; cilk_for (int i=0; i<n; ++i) { foo(i); } cilk_for (int i=0; i<n; ++i) { foo(i); } Do this: Not this:

27 Three Tips on Overheads 1.Make sure that work/#spawns is not too small. Coarsen by using function calls and inlining near the leaves of recursion rather than spawning. 2.Parallelize outer loops if you can, not inner loops. If you must parallelize an inner loop, coarsen it, but not too much. 500 iterations should be plenty coarse for even the most meager loop. Fewer iterations should suffice for “fatter” loops. 3.Use reducers only in sufficiently fat loops.

28 Sorting ∙Sorting is possibly the most frequently executed operation in computing! ∙Quicksort is the fastest sorting algorithm in practice with an average running time of O(N log N), (but O(N 2 ) worst case performance) ∙Mergesort has worst case performance of O(N log N) for sorting N elements ∙Both based on the recursive divide-and- conquer paradigm

29 QUICKSORT ∙Basic Quicksort sorting an array S works as follows:  If the number of elements in S is 0 or 1, then return.  Pick any element v in S. Call this pivot.  Partition the set S-{v} into two disjoint groups: ♦ S 1 = {x  S-{v} | x  v} ♦ S 2 = {x  S-{v} | x  v}  Return quicksort(S 1 ) followed by v followed by quicksort(S 2 )

30 QUICKSORT Select Pivot

31 QUICKSORT Partition around Pivot

32 QUICKSORT Quicksort recursively