CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Distributed Computing 1. Lower bound for leader election on a complete graph Shmuel Zaks ©
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.
Reference: Message Passing Fundamentals.
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)
1 Trees. 2 Outline –Tree Structures –Tree Node Level and Path Length –Binary Tree Definition –Binary Tree Nodes –Binary Search Trees.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
1 Pertemuan 20 Run-Time Environment Matakuliah: T0174 / Teknik Kompilasi Tahun: 2005 Versi: 1/6.
3.5 Interprocess Communication
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
1 Chapter 4 Threads Threads: Resource ownership and execution.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
Presentation Overview 1. Models of Parallel Computing The evolution of the conceptual framework behind parallel systems. 2.Grid Computing The creation.
Binary Trees Chapter 6.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Review C++ exception handling mechanism Try-throw-catch block How does it work What is exception specification? What if a exception is not caught?
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Chapter 3: Processes. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 3: Processes Process Concept Process Scheduling Operations.
Chapter 3: Processes. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts - 7 th Edition, Feb 7, 2006 Process Concept Process – a program.
Lecture 2 Foundations and Definitions Processes/Threads.
Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler.
1 Processes, Threads, Race Conditions & Deadlocks Operating Systems Review.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Processes. Chapter 3: Processes Process Concept Process Scheduling Operations on Processes Cooperating Processes Interprocess Communication Communication.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
FALL 2005CENG 213 Data Structures1 Priority Queues (Heaps) Reference: Chapter 7.
AVL Trees and Heaps. AVL Trees So far balancing the tree was done globally Basically every node was involved in the balance operation Tree balancing can.
Priority Queues, Heaps, and Heapsort CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process Termination Process executes last statement and asks the operating.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
1 Priority Queues (Heaps). 2 Priority Queues Many applications require that we process records with keys in order, but not necessarily in full sorted.
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
"Teachers open the door, but you must enter by yourself. "
CILK: An Efficient Multithreaded Runtime System
Planning & System installation
5.13 Recursion Recursive functions Functions that call themselves
Chapter 3: Process Concept
Prabhanjan Kambadur, Open Systems Lab, Indiana University
Chapter 4 – Thread Concepts
Hashing Exercises.
Algorithm design and Analysis
Multithreaded Programming in Cilk LECTURE 1
"Teachers open the door, but you must enter by yourself. "
Trees CMSC 202, Version 5/02.
Introduction to CILK Some slides are from:
Transactions with Nested Parallelism
Atlas: An Infrastructure for Global Computing
Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines
Foundations and Definitions
CO4301 – Advanced Games Development Week 4 Binary Search Trees
Chapter 3: Process Concept
Introduction to CILK Some slides are from:
Presentation transcript:

CILK: An Efficient Multithreaded Runtime System

People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad Kuszmaul (now Yale) –Charles Leiserson (MIT, Akamai) –Keith Randall (Bell Labs) –Yuli Zhou (Bell Labs)

Outline n Introduction n Programming environment n The work-stealing thread scheduler n Performance of applications n Modeling performance n Proven Properties n Conclusions

Introduction n Why multithreading? To implement dynamic, asynchronous, concurrent programs. n Cilk programmer optimizes: –total work –critical path n A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

Introduction...

n Cilk program is a set of procedures n A procedure is a sequence of threads n Cilk threads are: –represented by nodes in the dag –Non-blocking: run to completion: no waiting or suspension: atomic units of execution

Introduction... n Threads can spawn child threads –downward edges connect a parent to its children n A child & parent can run concurrently. –Non-blocking threads  a child cannot return a value to its parent. –The parent spawns a successor that receives values from its children

Introduction... n A thread & its successor are parts of the same Cilk procedure. –connected by horizontal arcs n Children’s returned values are received before their successor begins: –They constitute data dependencies. –Connected by curved arcs

Introduction...

Introduction: Execution Time n Execution time of a Cilk program using P processors depends on: –Work (T 1 ): time for Cilk program with 1 processor to complete. –Critical path (T  ): the time to execute the longest directed path in the dag. –T P >= T 1 / P (not true for some searches) –T P >= T 

Introduction: Scheduling n Cilk uses run time scheduling called work stealing. n Works well on dynamic, asynchronous, MIMD-style programs. n For “fully strict” programs, Cilk achieves asymptotic optimality for: space, time, & communication

Introduction: language n Cilk is an extension of C n Cilk programs are: –preprocessed to C –linked with a runtime library

Programming Environment n Declaring a thread: thread T ( ) { } T is preprocessed into a C function of 1 argument and return type void. n The 1 argument is a pointer to a closure

Environment: Closure n A closure is a data structure that has: –a pointer to the C function for T –a slot for each argument (inputs & continuations) –a join counter: count of the missing argument values n A closure is ready when join counter == 0. n A closure is waiting otherwise. n They are allocated from a runtime heap

Environment: Continuation A Cilk continuation is a data type, denoted by the keyword cont. cont int x; n It is a global reference to an empty slot of a closure. n It is implemented as 2 items: –a pointer to the closure; (what thread) –an int value: the slot number. (what input)

Environment: Closure

Environment: spawn n To spawn a child, a thread creates its closure: spawn T ( ) –creates child’s closure –sets available arguments –sets join counter n To specify a missing argument, prefix with a “?” spawn T (k, ?x);

Environment: spawn_next n A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) n Children typically have no missing arguments; successors do.

Explicit continuation passing n Nonblocking threads  a parent cannot block on children’s results. n It spawns a successor thread. n This communication paradigm is called explicit continuation passing. n Cilk provides a primitive to send a value from one closure to another.

send_argument n Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn spawn_next send_argument parent child successor

Cilk Procedure for computing a Fibonacci number thread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 ); } thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }

Nonblocking Threads: Advantages n Shallow call stack. (for us: fault tolerance ) n Simplify runtime system: Completed threads leave C runtime stack empty. n Portable runtime implementation

Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.

Work-Stealing Scheduler n The concept of work-stealing goes at least as far back as n Work-stealing: –a process with no work selects a victim from which to get work. –it gets the shallowest thread in the victim’s spawn tree. n In Cilk, thieves choose victims randomly.

Thread Level

Stealing Work: The Ready Deque n Each closure has a level: –level( child ) = level( parent ) + 1 –level( successor ) = level( parent ) n Each processor maintains a ready deque: –Contains ready closures –The L th element contains the list of all ready closures whose level is L.

Ready deque if ( ! readyDeque.isEmpty() ) take deepest thread else steal shallowest thread from readyDeque of randomly selected victim

Why Steal Shallowest closure? n Shallow threads probably produce more work, therefore, reduce communication. n Shallow threads more likely to be on critical path.

Readying a Remote Closure If a send_argument makes a remote closure ready, put closure on sending processor’s readyDeque –  extra communication. –Done to make scheduler provably good –Putting on local readyDeque works well in practice.

Performance of Application n T serial = time for C program n T 1 = time for 1-processor Cilk program n T serial /T 1 = efficiency of the Cilk program –Efficiency is close to 1 for programs with moderately long threads: Cilk overhead is small.

Performance of Applications n T 1 /T P = speedup n T 1 / T  = average parallelism n If average parallelism is large then speedup is nearly perfect. n If average parallelism is small then speedup is much smaller.

Performance Data

Performance of Applications n Application speedup = efficiency X speedup n = ( T serial /T 1 ) X ( T 1 /T P ) = T serial / T P

Modeling Performance n T P >= max( T , T 1 / P ) n A good scheduler should come close to these lower bounds.

Modeling Performance Empirical data suggests that for Cilk: T P  c 1 T 1 / P + c  T , where c 1  & c   If T 1 / T  > 10P then critical path does not affect T P.

Proven Property: Time Time: Including overhead, T P = O( T 1 /P + T  ), which is asymptotically optimal

Conclusions n We can predict the performance of a Cilk program by observing machine-independent characteristics: –Work –Critical path when the program is fully-strict. n Cilk’s usefulness is unclear for other kinds of programs (e.g., iterative programs).

Conclusions... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).

Conclusions... n Great system research has a theoretical underpinning. n Such research identifies important properties –of the systems themselves, or –of our ability to reason about them formally. n Cilk identified 3 significant system properties: –Fully strict programs –Non-blocking threads –Randomly choosing a victim.

END

The Cost of Spawns n A spawn is about an order of magnitude more costly than a C function call. n Spawned threads running on parent’s processor can be implemented more efficiently than remote spawns. –This usually is the case. n Compiler techniques can exploit this distinction.

Communication Efficiency n A request is an attempt to steal work (the victim may not have work). n Requests/processor & steals/processor both grow as the critical path grows.

Proven Properties: Space n A fully strict program’s threads send arguments only to its parent’s successors. n For such programs, space, time, & communication bounds are proven. n Space: S P <= S 1 P. –There exists a P-processor execution for which this is asymptotically optimal.

Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T  P S MAX ) where S MAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T  P S MAX, for some constant, c.