Download presentation
Presentation is loading. Please wait.
Published byClemence Garrison Modified over 9 years ago
1
CILK: An Efficient Multithreaded Runtime System
2
People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad Kuszmaul (now Yale) –Charles Leiserson (MIT, Akamai) –Keith Randall (Bell Labs) –Yuli Zhou (Bell Labs)
3
Outline n Introduction n Programming environment n The work-stealing thread scheduler n Performance of applications n Modeling performance n Proven Properties n Conclusions
4
Introduction n Why multithreading? To implement dynamic, asynchronous, concurrent programs. n Cilk programmer optimizes: –total work –critical path n A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)
5
Introduction...
6
n Cilk program is a set of procedures n A procedure is a sequence of threads n Cilk threads are: –represented by nodes in the dag –Non-blocking: run to completion: no waiting or suspension: atomic units of execution
7
Introduction... n Threads can spawn child threads –downward edges connect a parent to its children n A child & parent can run concurrently. –Non-blocking threads a child cannot return a value to its parent. –The parent spawns a successor that receives values from its children
8
Introduction... n A thread & its successor are parts of the same Cilk procedure. –connected by horizontal arcs n Children’s returned values are received before their successor begins: –They constitute data dependencies. –Connected by curved arcs
9
Introduction...
10
Introduction: Execution Time n Execution time of a Cilk program using P processors depends on: –Work (T 1 ): time for Cilk program with 1 processor to complete. –Critical path (T ): the time to execute the longest directed path in the dag. –T P >= T 1 / P (not true for some searches) –T P >= T
11
Introduction: Scheduling n Cilk uses run time scheduling called work stealing. n Works well on dynamic, asynchronous, MIMD-style programs. n For “fully strict” programs, Cilk achieves asymptotic optimality for: space, time, & communication
12
Introduction: language n Cilk is an extension of C n Cilk programs are: –preprocessed to C –linked with a runtime library
13
Programming Environment n Declaring a thread: thread T ( ) { } T is preprocessed into a C function of 1 argument and return type void. n The 1 argument is a pointer to a closure
14
Environment: Closure n A closure is a data structure that has: –a pointer to the C function for T –a slot for each argument (inputs & continuations) –a join counter: count of the missing argument values n A closure is ready when join counter == 0. n A closure is waiting otherwise. n They are allocated from a runtime heap
15
Environment: Continuation A Cilk continuation is a data type, denoted by the keyword cont. cont int x; n It is a global reference to an empty slot of a closure. n It is implemented as 2 items: –a pointer to the closure; (what thread) –an int value: the slot number. (what input)
16
Environment: Closure
17
Environment: spawn n To spawn a child, a thread creates its closure: spawn T ( ) –creates child’s closure –sets available arguments –sets join counter n To specify a missing argument, prefix with a “?” spawn T (k, ?x);
18
Environment: spawn_next n A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) n Children typically have no missing arguments; successors do.
19
Explicit continuation passing n Nonblocking threads a parent cannot block on children’s results. n It spawns a successor thread. n This communication paradigm is called explicit continuation passing. n Cilk provides a primitive to send a value from one closure to another.
20
send_argument n Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn spawn_next send_argument parent child successor
21
Cilk Procedure for computing a Fibonacci number thread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 ); } thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }
22
Nonblocking Threads: Advantages n Shallow call stack. (for us: fault tolerance ) n Simplify runtime system: Completed threads leave C runtime stack empty. n Portable runtime implementation
23
Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.
24
Work-Stealing Scheduler n The concept of work-stealing goes at least as far back as 1981. n Work-stealing: –a process with no work selects a victim from which to get work. –it gets the shallowest thread in the victim’s spawn tree. n In Cilk, thieves choose victims randomly.
25
Thread Level
26
Stealing Work: The Ready Deque n Each closure has a level: –level( child ) = level( parent ) + 1 –level( successor ) = level( parent ) n Each processor maintains a ready deque: –Contains ready closures –The L th element contains the list of all ready closures whose level is L.
27
Ready deque if ( ! readyDeque.isEmpty() ) take deepest thread else steal shallowest thread from readyDeque of randomly selected victim
28
Why Steal Shallowest closure? n Shallow threads probably produce more work, therefore, reduce communication. n Shallow threads more likely to be on critical path.
29
Readying a Remote Closure If a send_argument makes a remote closure ready, put closure on sending processor’s readyDeque – extra communication. –Done to make scheduler provably good –Putting on local readyDeque works well in practice.
30
Performance of Application n T serial = time for C program n T 1 = time for 1-processor Cilk program n T serial /T 1 = efficiency of the Cilk program –Efficiency is close to 1 for programs with moderately long threads: Cilk overhead is small.
31
Performance of Applications n T 1 /T P = speedup n T 1 / T = average parallelism n If average parallelism is large then speedup is nearly perfect. n If average parallelism is small then speedup is much smaller.
32
Performance Data
33
Performance of Applications n Application speedup = efficiency X speedup n = ( T serial /T 1 ) X ( T 1 /T P ) = T serial / T P
34
Modeling Performance n T P >= max( T , T 1 / P ) n A good scheduler should come close to these lower bounds.
35
Modeling Performance Empirical data suggests that for Cilk: T P c 1 T 1 / P + c T , where c 1 1.067 & c 1.042 If T 1 / T > 10P then critical path does not affect T P.
36
Proven Property: Time Time: Including overhead, T P = O( T 1 /P + T ), which is asymptotically optimal
37
Conclusions n We can predict the performance of a Cilk program by observing machine-independent characteristics: –Work –Critical path when the program is fully-strict. n Cilk’s usefulness is unclear for other kinds of programs (e.g., iterative programs).
38
Conclusions... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).
39
Conclusions... n Great system research has a theoretical underpinning. n Such research identifies important properties –of the systems themselves, or –of our ability to reason about them formally. n Cilk identified 3 significant system properties: –Fully strict programs –Non-blocking threads –Randomly choosing a victim.
40
END
41
The Cost of Spawns n A spawn is about an order of magnitude more costly than a C function call. n Spawned threads running on parent’s processor can be implemented more efficiently than remote spawns. –This usually is the case. n Compiler techniques can exploit this distinction.
42
Communication Efficiency n A request is an attempt to steal work (the victim may not have work). n Requests/processor & steals/processor both grow as the critical path grows.
43
Proven Properties: Space n A fully strict program’s threads send arguments only to its parent’s successors. n For such programs, space, time, & communication bounds are proven. n Space: S P <= S 1 P. –There exists a P-processor execution for which this is asymptotically optimal.
44
Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T P S MAX ) where S MAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T P S MAX, for some constant, c.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.