Communicating Runtimes in CnC Zoran Budimlić1, Kath Knobe1, Frank Schlimbach2 1Rice University, 2Intel Corporation
How it all started RT1 RT2 ? G1.cnc G2.cnc
What it evolved into Software Engineering Reuse Encapsulation Hierarchical (de)composition Heterogeneous Execution Different runtimes good at doing different things Distributed Implementation Free! Incorporating specialized, optimized (possibly non-CnC) components into CnC Graph Optimizations
Unified CnC iCnC CnC-Babel CnC-OCR CnC-Scala CnC-HC CnC-Qthreads CnC-HJ CnC-Qthreads CnC-Haskell
Unified CnC-OCR and iCnC Generalize the CnC-OCR Framework Increase graph specification expressiveness Generate iCnC scaffolding Remove OCR abstractions from API
Why unification? Why? Debugging Hierarchy Performance Portability Features Heterogeneity Communication
Goals Enable composing of large CnC application from smaller components Allow A, B, and G to be specified in separate .cnc files Specify how A, B, and G are connected in separate .comm files Respect the encapsulation A and B shouldn’t know anything about G G shouldn’t know anything about the implementations of A and B {G} G.cnc {A} A.cnc {B} B.cnc
How inner graph sees its I/O collections IG: IG behaves like a CnC graph. As far as {IG} understands, the environment produces parts of [X] and consumes parts of [Y]. Both [X] and [Y] use collection data structures that {IG} understands. [X] [Y] …
How outer graph sees IG’s I/O collections G incorporates a graph-like inner node called {IG} As far as G is concerned, IG consumes [A] and produces [B] Both [A] and [B] use a collection data structure that G understands IG … [A] [B]
These are specified in a separate .comm file This is the part of the communication layer that “converts” [Y] to [B] This is the part of the communication layer that “converts” [A] to [X] [A] [B] [X] [Y] …
Scoping Every component introduces its own scope Creating “instances” of components is simple { A @ A.cnc : AtoG.comm } { B @ A.cnc : BtoG.comm } Two instances of the same graph defined in A.cnc Can potentially be executed by different runtimes The .comm spec defines how exactly are the collections inside A and B components connected with the collections in the outer graph Collections in the outer scope can be produced/consumed by a component {A} <- [X] -> [Y] -> <Z>
.comm specification No $initialize and $finalize functions for the components This is now the role of the outer graph “Mirroring” of item and control collections: [A] == [X] [A] and [X] are just two names for the same collection [A:t1] == [X:t2] $when (f(t1,t2)) [A] is a view on [X] that only “sees” the items that satisfy the condition f(t1,t2) Enable “prescribing with data” capability (onPut_B:tag) <::- [B:tag] Prescribe a step (onPut_B:tag) for every item that appears in [B]
Example G.cnc MM.cnc (P) A X Z C (S) (Q) B Y IGtoG.comm: [A:i,j] $when(i<j) == [X:i,j] (onPut_B:i,j) <::- [B:i,j] (onPut_Z) <::- [Z] G.cnc {IG @ MM.cnc : IGtoG.comm } {IG} <- [A] <- [B] -> [C] MM.cnc (P) -> [A] (P) A Compiler generated stubs: G_onPut_B(tag, value){ if(tag.i < tag.j) IG.send(“put”, “B”, tag, value); } G_IG_onPut_B(tag, value){ Y.put(tag, value); G_IG_onPut_IG_Z(tag, value){ if(isPrime(value)) G.send(“put”,”Z”, tag, value); G_onPut_IG_Z(tag,value){ C.put(tag, value); X A.put(…) Z (Q) -> [B] (S) <- [C] C (S) (Q) B Y B.put(…)
Possible APIs Each runtime needs to implement simple “onPut” APIs: “Inner” graph IG: “What to do when a message is received from the outer graph G that a put into collection A with tag T and value V has happened” “What to communicate to the outer graph G when a put into collection X with tag T and value V has happened” “Outer” graph G: “What to do when I want to put into a collection A” “What to do when a message is received from the inner graph that a put into collection X has happened” [A] == [X] Execute G_IG_onPut_A(T, V) Default: X.put(T, V) Execute G_IG_onPut_X(T, V) Default: G.send (“put”, “X”, T, V) Execute G_onPut_A(T, V) Default: IG.send(“put”, “A”, T, V) Execute G_onPut_X(T, V) Default: A.put(T, V)
Default APIs Default APIs are simple to implement Compiler-generated or configured at startup: [A: i,j] == [X : i,j] $when (i < j) S is only interested in the upper triangular part of A.C “What to communicate to the outer graph G when a put into collection X with tag <i, j> and value V has happened” if (i < j) G.send(“put”, “X”, <i, j>, V) User defined (potentially data dependent) if (isPrime(V)) G.send(“put”, “X”, T, V)
Default APIs “What to do when a message is received from the inner graph that a put into collection X has happened” Default: A.put(tag, value) Explicit collection A within the outer graph With all the hidden semantics of what a “put” means in the runtime that’s executing the outer graph It can be something smarter/faster depending on how A is used “What to do when I want to put into a collection A” Default: IG.send(“put”, “A”, T, V) If A is not used within the outer graph, no need for an explicit collection
Connecting two components AtoG.comm: [C] == [X] G.cnc {A @ G1.cnc : AtoG.comm } {B @ G2.cnc : BtoG.comm } BtoG.comm: [C] == [Y] {A} -> [C] {B} <- [C] {A} {B} X Y C G1.cnc G2.cnc
Where does a collection “live” If RT(G) = RT(A) = RT(B) Only one physical copy. “Lives” in RT(G,A,B) All onPut functions are null If RT(G) ≠ RT(A) ≠ RT(B) Three copies of the same collection X lives in RT(A), C lives in RT(G), Y lives in RT(B) Some can be lightweight or implicit C may be just forwarding the onPut calls If RT(G) = RT(A) ≠ RT(B) Two copies. [C & X] lives in RT(G,A), Y lives in RT(B) If RT(A) = RT(B) ≠ RT(G) Two copies. [X & Y] live in RT(X,Y), C lives in RT(G) C may be lightweight, implicit, or non-existant {G} {A} {B} X Y C
What about control? “Bring back” Control Collections! Explicit control collections allow us to treat them the same as item collections Otherwise, we’d have to “piggyback” on prescriptions from other components (S) == (Q) When a step in (S) is prescribed, so is a step in (Q), with the same tag But what about custom onPut functions? <T> == <U> Much more intuitive, treated in exact same way as item collections Custom onPut functions implemented as for item collections
Compilation Start from the top of the hierarchy tree Calculate the prefix (Cholesky.TU for example) for each component Object-oriented notation in C++ and Java based implementations Underscores in C-based implementations Parse the .comm file for the component Generate the default onPut functions Generate the stubs for the custom onPut functions Recursively compile the component Each set of files (user step implementations, user onPut implementations, generated runtime files, generated onPut functions) is compiled by the appropriate compiler
Runtime extensions Need a “daemon” in each runtime to monitor all the events that need to be communicated to other runtimes In CnC-OCR, that’s the communication worker In iCnC, that may be a dedicated thread In CnC-HJ, that would be the communication worker Extend the runtime to always call the appropriate onPut function (if present) when a put happens Data Serialization Standard C++ serialization for C-based runtimes Need to define standard set of data types for more heterogeneous cases (Java, Python, Haskell)
Communication between runtimes In general, all runtimes need to use the same communication layer (MPI, GasNet, …), and run in separate processes Runtimes that use the same platform CnC-OCR, iCnC, CnC-HC Java, Scala, Jython May be able to use shared memory access and avoid communication On a case-by-case basis
Optimizations Components that are of the same type and on the same node can be executed by the same runtime No need for communication Components that are on the same node can use the communication layer optimization MPI shared memory communication
Optimizations OCR O Intel OCR Intel Java2 Java1 J O I MPI MPI MPI
Conclusions Unified CnC offers a unique opportunity for a standard way to create CnC programs regardless of the platform Treatment of both control and data as tuples allows simple connection of different runtimes running different CnC programs Fundamental capability for enabling hierarchical and distributed execution Future Work: Implement the communication layer on Rice and Intel CnC Evaluate the heterogeneous approach, with different runtimes better optimized for different platforms Include highly optimized non-CnC inner graphs with CnC interfaces
Backup slides
Unified CnC translator process CnC skeleton project CnC graph specification (*.cnc) Graph parser (pyparsing) Template engine (jinja2) Internal AST