CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

CILK: An Efficient Multithreaded Runtime System

People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad Kuszmaul (now Yale) –Charles Leiserson (MIT, Akamai) –Keith Randall (Bell Labs) –Yuli Zhou (Bell Labs)

Outline n Introduction n Programming environment n The work-stealing thread scheduler n Performance of applications n Modeling performance n Proven Properties n Conclusions

Introduction n Why multithreading? To implement dynamic, asynchronous, concurrent programs. n Cilk programmer optimizes: –total work –critical path n A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

Introduction...

n Cilk program is a set of procedures n A procedure is a sequence of threads n Cilk threads are: –represented by nodes in the dag –Non-blocking: run to completion: no waiting or suspension: atomic units of execution

Introduction... n Threads can spawn child threads –downward edges connect a parent to its children n A child & parent can run concurrently. –Non-blocking threads  a child cannot return a value to its parent. –The parent spawns a successor that receives values from its children

Introduction... n A thread & its successor are parts of the same Cilk procedure. –connected by horizontal arcs n Children’s returned values are received before their successor begins: –They constitute data dependencies. –Connected by curved arcs

Introduction...

Introduction: Execution Time n Execution time of a Cilk program using P processors depends on: –Work (T 1 ): time for Cilk program with 1 processor to complete. –Critical path (T  ): the time to execute the longest directed path in the dag. –T P >= T 1 / P (not true for some searches) –T P >= T 

Introduction: Scheduling n Cilk uses run time scheduling called work stealing. n Works well on dynamic, asynchronous, MIMD-style programs. n For “fully strict” programs, Cilk achieves asymptotic optimality for: space, time, & communication

Introduction: language n Cilk is an extension of C n Cilk programs are: –preprocessed to C –linked with a runtime library

Programming Environment n Declaring a thread: thread T ( ) { } T is preprocessed into a C function of 1 argument and return type void. n The 1 argument is a pointer to a closure

Environment: Closure n A closure is a data structure that has: –a pointer to the C function for T –a slot for each argument (inputs & continuations) –a join counter: count of the missing argument values n A closure is ready when join counter == 0. n A closure is waiting otherwise. n They are allocated from a runtime heap

Environment: Continuation A Cilk continuation is a data type, denoted by the keyword cont. cont int x; n It is a global reference to an empty slot of a closure. n It is implemented as 2 items: –a pointer to the closure; (what thread) –an int value: the slot number. (what input)

Environment: Closure

Environment: spawn n To spawn a child, a thread creates its closure: spawn T ( ) –creates child’s closure –sets available arguments –sets join counter n To specify a missing argument, prefix with a “?” spawn T (k, ?x);

Environment: spawn_next n A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) n Children typically have no missing arguments; successors do.

Explicit continuation passing n Nonblocking threads  a parent cannot block on children’s results. n It spawns a successor thread. n This communication paradigm is called explicit continuation passing. n Cilk provides a primitive to send a value from one closure to another.

send_argument n Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn spawn_next send_argument parent child successor

Cilk Procedure for computing a Fibonacci number thread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 ); } thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }

Nonblocking Threads: Advantages n Shallow call stack. (for us: fault tolerance ) n Simplify runtime system: Completed threads leave C runtime stack empty. n Portable runtime implementation

Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.

Work-Stealing Scheduler n The concept of work-stealing goes at least as far back as 1981. n Work-stealing: –a process with no work selects a victim from which to get work. –it gets the shallowest thread in the victim’s spawn tree. n In Cilk, thieves choose victims randomly.

Thread Level

Stealing Work: The Ready Deque n Each closure has a level: –level( child ) = level( parent ) + 1 –level( successor ) = level( parent ) n Each processor maintains a ready deque: –Contains ready closures –The L th element contains the list of all ready closures whose level is L.

Ready deque if ( ! readyDeque.isEmpty() ) take deepest thread else steal shallowest thread from readyDeque of randomly selected victim

Why Steal Shallowest closure? n Shallow threads probably produce more work, therefore, reduce communication. n Shallow threads more likely to be on critical path.

Readying a Remote Closure If a send_argument makes a remote closure ready, put closure on sending processor’s readyDeque –  extra communication. –Done to make scheduler provably good –Putting on local readyDeque works well in practice.

Performance of Application n T serial = time for C program n T 1 = time for 1-processor Cilk program n T serial /T 1 = efficiency of the Cilk program –Efficiency is close to 1 for programs with moderately long threads: Cilk overhead is small.

Performance of Applications n T 1 /T P = speedup n T 1 / T  = average parallelism n If average parallelism is large then speedup is nearly perfect. n If average parallelism is small then speedup is much smaller.

Performance Data

Performance of Applications n Application speedup = efficiency X speedup n = ( T serial /T 1 ) X ( T 1 /T P ) = T serial / T P

Modeling Performance n T P >= max( T , T 1 / P ) n A good scheduler should come close to these lower bounds.

Modeling Performance Empirical data suggests that for Cilk: T P  c 1 T 1 / P + c  T , where c 1  1.067 & c   1.042 If T 1 / T  > 10P then critical path does not affect T P.

Proven Property: Time Time: Including overhead, T P = O( T 1 /P + T  ), which is asymptotically optimal

Conclusions n We can predict the performance of a Cilk program by observing machine-independent characteristics: –Work –Critical path when the program is fully-strict. n Cilk’s usefulness is unclear for other kinds of programs (e.g., iterative programs).

Conclusions... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).

Conclusions... n Great system research has a theoretical underpinning. n Such research identifies important properties –of the systems themselves, or –of our ability to reason about them formally. n Cilk identified 3 significant system properties: –Fully strict programs –Non-blocking threads –Randomly choosing a victim.

The Cost of Spawns n A spawn is about an order of magnitude more costly than a C function call. n Spawned threads running on parent’s processor can be implemented more efficiently than remote spawns. –This usually is the case. n Compiler techniques can exploit this distinction.

Communication Efficiency n A request is an attempt to steal work (the victim may not have work). n Requests/processor & steals/processor both grow as the critical path grows.

Proven Properties: Space n A fully strict program’s threads send arguments only to its parent’s successors. n For such programs, space, time, & communication bounds are proven. n Space: S P <= S 1 P. –There exists a P-processor execution for which this is asymptotically optimal.

Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T  P S MAX ) where S MAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T  P S MAX, for some constant, c.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Similar presentations

Presentation on theme: "CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Similar presentations

Presentation on theme: "CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad."— Presentation transcript:

Similar presentations

About project

Feedback