CS 240A: Shared Memory & Multicore Programming with Cilk++

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.

MATH 224 – Discrete Mathematics

Distributed Systems CS

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.

CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.

CS 240A: Complexity Measures for Parallel Computation.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Computer System Architectures Computer System Software

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.

Synchronization (Barriers) Parallel Processing (CS453)

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

© 2008 CILK ARTS, Inc.1 Executive Briefing: Multicore- Enabling SaaS Applications September 3, 2008 Cilk++, Cilk, Cilkscreen, and Cilk Arts are trademarks.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.

 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

CS333 Intro to Operating Systems Jonathan Walpole.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

Parallelism without Concurrency Charles E. Leiserson MIT.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

Martin Kruliš by Martin Kruliš (v1.1)1.

CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency and Performance Based on slides by Henri Casanova.

Tuning Threaded Code with Intel® Parallel Amplifier.

Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)

Chapter 4 – Thread Concepts

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

CILK: An Efficient Multithreaded Runtime System

Processes and threads.

CS 140 : Numerical Examples on Shared Memory with Cilk++

CMPS 5433 Programming Models

Lecture 5: Shared-memory Computing with Open MP

Chapter 4 – Thread Concepts

CS399 New Beginnings Jonathan Walpole.

Computer Engg, IIT(BHU)

Task Scheduling for Multicore CPUs and NUMA Systems

Multithreaded Programming in Cilk LECTURE 1

Jonathan Walpole Computer Science Portland State University

Introduction to CILK Some slides are from:

Distributed Systems CS

Programming with Shared Memory

Transactions with Nested Parallelism

Multithreaded Programming in Cilk Lecture 1

Chapter 4: Threads & Concurrency

Programming with Shared Memory

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.

- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts: multiprogramming, multiprocessing, multitasking,

CS333 Intro to Operating Systems

Cilk and Writing Code for Hardware

Introduction to CILK Some slides are from:

Presentation transcript:

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some of these slides

Multicore Architecture Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Multicore Architecture Memory I/O Network … $ $ $ core core core Chip Multiprocessor (CMP) 2

cc-NUMA Architectures AMD 8-way Opteron Server (neumann@cs.ucsb.edu) A processor (CMP) with 2/4 cores Memory bank local to a processor Point-to-point interconnect

cc-NUMA Architectures No Front Side Bus Integrated memory controller On-die interconnect among CMPs Main memory is physically distributed among CMPs (i.e. each piece of memory has an affinity to a CMP) NUMA: Non-uniform memory access. For multi-socket servers only Your desktop is safe (well, for now at least) Triton nodes are not NUMA either

Desktop Multicores Today This is your AMD Barcelona or Intel Core i7 ! On-die interconnect Private cache: Cache coherence is required

Multithreaded Programming POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE “Assembly language” of shared memory programming Programmer has to manually: Create and terminate threads Wait for threads to complete Manage interaction between threads using mutexes, condition variables, etc.

Concurrency Platforms Programming directly on PThreads is painful and error-prone. With PThreads, you either sacrifice memory usage or load-balance among processors A concurrency platform provides linguistic support and handles load balancing. Examples: Threading Building Blocks (TBB) OpenMP Cilk++

Cilk vs PThreads How will the following code execute in PThreads? In Cilk? for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); } sync-or-join; What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo? They have different liveness properties: Cilk threads are spawned lazily, “may” parallelism PThreads are spawned eagerly, “must” parallelism

Cilk vs OpenMP Cilk++ guarantees space bounds On P processors, Cilk++ uses no more than P times the stack space of a serial execution. Cilk++ has a solution for global variables (called “reducers” / “hyperobjects”) Cilk++ has nested parallelism that works and provides guaranteed speed-up. Indeed, cilk scheduler is provably optimal. Cilk++ has a race detector (cilkscreen) for debugging and software release. Keep in mind that platform comparisons are (always will be) subject to debate Note that these all are subject to debate

Complexity Measures SPAN LAW WORK LAW Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Complexity Measures TP = execution time on P processors T1 = work T∞ = span* WORK LAW TP ≥T1/P SPAN LAW TP ≥ T∞ * Also called critical-path length or computational depth. 10

Series Composition A B Work: T1(A∪B) = Work: T1(A∪B) = T1(A) + T1(B) Span: T∞(A∪B) = Span: T∞(A∪B) = T∞(A) +T∞(B)

Parallel Composition A B Work: T1(A∪B) = T1(A) + T1(B) Work: T1(A∪B) = Span: T∞(A∪B) = max{T∞(A), T∞(B)} Span: T∞(A∪B) =

Speedup Def. T1/TP = speedup on P processors. Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Speedup Def. T1/TP = speedup on P processors. If T1/TP = (P), we have linear speedup, = P, we have perfect linear speedup, > P, we have superlinear speedup, which is not possible in this performance model, because of the Work Law TP ≥ T1/P. 13

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Scheduling A strand is a sequence of instructions that doesn’t contain any parallel constructs Cilk++ allows the programmer to express potential parallelism in an application. The Cilk++ scheduler maps strands onto processors dynamically at runtime. Since on-line schedulers are complicated, we’ll explore the ideas with an off-line scheduler. Network … Memory I/O P $ 14

IDEA: Do as much as possible on every step. Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Greedy Scheduling IDEA: Do as much as possible on every step. Definition: A strand is ready if all its predecessors have executed. 15

IDEA: Do as much as possible on every step. Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Greedy Scheduling IDEA: Do as much as possible on every step. Definition: A strand is ready if all its predecessors have executed. P = 3 Complete step ≥ P strands ready. Run any P. 16

IDEA: Do as much as possible on every step. Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Greedy Scheduling IDEA: Do as much as possible on every step. Definition: A strand is ready if all its predecessors have executed. P = 3 Complete step ≥ P strands ready. Run any P. Incomplete step < P strands ready. Run all of them. 17

Analysis of Greedy TP  T1/P + T∞. P = 3 Multithreaded Programming in Cilk Lecture 1 Analysis of Greedy July 13, 2006 Theorem : Any greedy scheduler achieves TP  T1/P + T∞. P = 3 Proof. # complete steps  T1/P, since each complete step performs P work. # incomplete steps  T∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■ 18

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ≥ max{T1/P, T∞} by the Work and Span Laws, we have TP ≤ T1/P + T∞ ≤ 2⋅max{T1/P, T∞} ≤ 2TP* . ■ 19

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Linear Speedup Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T1/T∞. Proof. Since P ≪ T1/T∞ is equivalent to T∞ ≪ T1/P, the Greedy Scheduling Theorem gives us TP ≤ T1/P + T∞ ≈ T1/P . Thus, the speedup is T1/TP ≈ P. ■ Definition. The quantity T1/PT∞ is called the parallel slackness. 20

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call spawn call spawn call call spawn call call Call! P P P P 21

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call spawn call spawn call call spawn call spawn call Spawn! P P P P 22

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call spawn call spawn call call spawn call spawn call spawn Spawn! Call! Spawn! spawn call P P P P 23

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call spawn call spawn call call spawn call spawn spawn call Return! call P P P P 24

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call spawn call spawn call spawn call spawn call spawn Return! spawn call P P P P 25

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call call spawn call spawn call spawn call spawn Steal! spawn call P P P P When a worker runs out of work, it steals from the top of a random victim’s deque. 26

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call call spawn call spawn call spawn call spawn Steal! spawn call P P P P When a worker runs out of work, it steals from the top of a random victim’s deque. 27

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call call spawn call spawn call spawn call spawn spawn call P P P P When a worker runs out of work, it steals from the top of a random victim’s deque. 28

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call call spawn call spawn spawn call spawn call spawn Spawn! spawn call P P P P When a worker runs out of work, it steals from the top of a random victim’s deque. 29

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call call spawn call spawn spawn call spawn call spawn spawn call P P P P When a worker runs out of work, it steals from the top of a random victim’s deque. 30

Cilk++ Runtime System P P P P Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack spawn call spawn call call spawn call spawn spawn call spawn call spawn spawn call P P P P Theorem: With sufficient parallelism, workers steal infrequently  linear speed-up. 31

Great, how do we program it? Cilk++ is a faithful extension of C++ Often use divide-and-conquer Three (really two) hints to the compiler: cilk_spawn: this function can run in parallel with the caller cilk_sync: all spawned children must return before execution can continue cilk_for: all iterations of this loop can run in parallel Compiler translates cilk_for into cilk_spawn & cilk_sync under the covers

Nested Parallelism Example: Quicksort The named child function may execute in parallel with the parent caller. template <typename T> void qsort(T begin, T end) { if (begin != end) { T middle = partition( begin, end, bind2nd( less<typename iterator_traits<T>::value_type>(), *begin ) ); cilk_spawn qsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync; } Control cannot pass this point until all spawned children have returned. 33

Cilk++ Loops A cilk_for loop’s iterations execute in parallel. Example: Matrix transpose cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { B[i][j] = A[j][i]; } A cilk_for loop’s iterations execute in parallel. The index must be declared in the loop initializer. The end condition is evaluated exactly once at the beginning of the loop. Loop increments should be a const value 34

Reliable Single-Threaded Code Serial Correctness The serialization is the code with the Cilk++ keywords replaced by null or C++ keywords. int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } Cilk++ Compiler Conventional Compiler Cilk++ source Linker int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } Binary Serialization Serial correctness can be debugged and verified by running the multithreaded code on a single processor. Cilk++ Runtime Library Conventional Regression Tests Reliable Single-Threaded Code 35

Serialization How to seamlessly switch between serial c++ and parallel cilk++ programs? Add to the beginning of your program #ifdef CILKPAR #include <cilk.h> #else #define cilk_for for #define cilk_main main #define cilk_spawn #define cilk_sync #endif Compile ! cilk++ -DCILKPAR –O2 –o parallel.exe main.cpp g++ –O2 –o serial.exe main.cpp

Parallel Correctness int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } Cilk++ Compiler Conventional Compiler Cilk++ source Linker Binary Cilkscreen Race Detector Parallel correctness can be debugged and verified with the Cilkscreen race detector, which guarantees to find inconsistencies with the serial code Parallel Regression Tests Reliable Multi-Threaded Code 37

Race Bugs Definition. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. Example A int x = 0; A int x = 0; cilk_for(int i=0, i<2, ++i) { x++; } assert(x == 2); x++; x++; B C B C D assert(x == 2); D Dependency Graph 38

Race Bugs Definition. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. 1 x = 0; x++; int x = 0; assert(x == 2); A B C D 2 r1 = x; 4 r2 = x; 3 r1++; 5 r2++; 7 x = r1; 6 x = r2; 8 assert(x == 2); 39

Types of Races Suppose that instruction A and instruction B both access a location x, and suppose that A∥B (A is parallel to B). A B Race Type read none write read race write race Two sections of code are independent if they have no determinacy races between them. 40

Avoiding Races All the iterations of a cilk_for should be independent. Between a cilk_spawn and the corresponding cilk_sync, the code of the spawned child should be independent of the code of the parent, including code executed by additional spawned or called children. Ex. cilk_spawn qsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync; Note: The arguments to a spawned function are evaluated in the parent before the spawn occurs. 41

Cilk++ Reducers Hyperobjects: reducers, holders, splitters Primarily designed as a solution to global variables, but has broader application int result = 0; cilk_for (size_t i = 0; i < N; ++i) { result += MyFunc(i); } Data race ! #include <reducer_opadd.h> … cilk::hyperobject<cilk::reducer_opadd<int> > result; cilk_for (size_t i = 0; i < N; ++i) { result() += MyFunc(i); } Race free ! This uses one of the predefined reducers, but you can also write your own reducer easily

Hyperobjects under the covers A reducer hyperobject<T> includes an associative binary operator ⊗ and an identity element. Cilk++ runtime system gives each thread a private view of the global variable When threads synchronize, their private views are combined with ⊗

Cilkscreen Cilkscreen runs off the binary executable: Compile your program with –fcilkscreen Go to the directory with your executable and say cilkscreen your_program [options] Cilkscreen prints info about any races it detects Cilkscreen guarantees to report a race if there exists a parallel execution that could produce results different from the serial execution. It runs about 20 times slower than single-threaded real-time. 44

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelism Because the Span Law dictates that TP ≥ T∞, the maximum possible speedup given T1 and T∞ is T1/T∞ = parallelism = the average amount of work per step along the span. 45

Three Tips on Parallelism Minimize span to maximize parallelism. Try to generate 10 times more parallelism than processors for near-perfect linear speedup. If you have plenty of parallelism, try to trade some if it off for reduced work overheads. Use divide-and-conquer recursion or parallel loops rather than spawning one small thing off after another. Do this: cilk_for (int i=0; i<n; ++i) { foo(i); } Not this: for (int i=0; i<n; ++i) { cilk_spawn foo(i); } cilk_sync;

Three Tips on Overheads Make sure that work/#spawns is not too small. Coarsen by using function calls and inlining near the leaves of recursion rather than spawning. Parallelize outer loops if you can, not inner loops (otherwise, you’ll have high burdened parallelism, which includes runtime and scheduling overhead). If you must parallelize an inner loop, coarsen it, but not too much. 500 iterations should be plenty coarse for even the most meager loop. Fewer iterations should suffice for “fatter” loops. Use reducers only in sufficiently fat loops.