Tristan Aubrey-Jones Bernd Fischer Synthesizing MPI Implementations from Functional Data-Parallel Programs
People need to program distributed memory architectures: GPUs Future many-core architectures? Server farms/compute clusters Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband Graphics/GPGPU CUDA/OpenCL Memory levels: thread-local, block-shared, device-global. Our focus: Programming shared- nothing clusters
Our Aim To automatically synthesize distributed memory implementations (C++/MPI). We want this not just for arrays (i.e. other collection types & disk backed collections). Many distributions possible Single threadedParallel shared-memoryDistributed memory
Our Approach We define Flocc, a data-parallel DSL, where parallelism is expressed via combinator functions. We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations. We use extended types to carry data distribution information, and type inference to infer data distribution information.
FLOCC A data-parallel DSL (Functional language on compute clusters)
Flocc syntax Polyadic lambda calculus, standard call-by-value semantics.
Matrix multiplication in Flocc let R1 = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in … eqJoin :: ((k ₁,v ₁ ) → x, (k ₂,v ₂ ) → x, Map k ₁ v ₁, Map k ₂ v ₂ ) → Map (k ₂,k ₂ ) (v ₂,v ₂ ) SELECT * FROM A JOIN B ON A.j = B.i; Data-parallelism via combinator functions (abstracts away from individual element access)
Matrix multiplication in Flocc let R1 = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in let R2 = map (id, \(_,(av,bv)) -> mul (av,bv), R1) in … eqJoin :: ((k ₁,v ₁ ) → x, (k ₂,v ₂ ) → x, Map k ₁ v ₁, Map k ₂ v ₂ ) → Map (k ₂,k ₂ ) (v ₂,v ₂ ) map :: (k ₁ → k ₂, (k ₁,v ₁ ) →v ₂, Map k ₁ v ₁ ) → Map k ₂ v ₂ SELECT A.i as i, B.j as j, A.v * B.v as v FROM A JOIN B ON A.j = B.i;
Matrix multiplication in Flocc let R1 = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in let R2 = map (id, \(_,(av,bv)) -> mul (av,bv), R1) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), snd, add, R2) eqJoin :: ((k ₁,v ₁ ) → x, (k ₂,v ₂ ) → x, Map k ₁ v ₁, Map k ₂ v ₂ ) → Map (k ₂,k ₂ ) (v ₂,v ₂ ) map :: (k ₁ → k ₂, (k ₁,v ₁ ) →v ₂, Map k ₁ v ₁ ) → Map k ₂ v ₂ groupReduce :: ((k ₁,v ₁ ) →k ₂, (k ₁,v ₁ ) →v ₂, (v ₂,v ₂ ) →v ₂, Map k ₁ v ₁ ) → Map k ₂ v ₂ SELECT A.i as i, B.j as j, sum (A.v * B.v) as v FROM A JOIN B ON A.j = B.i GROUP BY A.i, B.j;
DISTRIBUTION TYPES Distributed data layout (DDL) types
let R = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), \(_,(av,bv)) -> mul (av,bv), add, R) map folded into groupReduce Deriving distribution plans eqJoin1 / eqJoin2 / eqJoin3 groupReduce1 / groupReduce 2 Different distributed implementations of each combinator. Enumerates different choices of combinator implementations. Each combinator implementation has a DDL type. Use type inference to check if a set of choices is sound, insert redistributions to make sound, and infer data distributions. Have backend template for each combinator implementation. Hidden from user
A Distributed Map type The DMap type extends the basic Map k v type, to symbolically describe how the Map should be distributed on the cluster. Key and value types t1 and t2. Partition function f: a function taking key-value pairs to node coordinates i.e. f : (t1,t2) N Partition dimension identifier d1: specifies which nodes the coordinates map onto. Mirror dimension identifier d2: specifies which nodes to mirror the partitions over. Also includes local layout mode and key, omitted for brevity. Also works for Arrays (DArr) and Lists (DList).
Some example distributions 012 0A0A0 A1A1 A2A ((x,y),z)fst.fst (…)HashPart ((1,0),1.0)11A1A1 ((1,1),2.3)11A1A1 ((2,1),1.5)22A2A2 ((3,1),3.5)30A0A0 ((3,2),1.0)30A0A0 ((4,0),3.1)41A1A1 ((4,4),0.5)41A1A1 ((5,1),0.0)52A2A2 ((6,2),5.7)60A0A0 ((6,3),2.1)60A0A0 2D node topology dim0 dim1
Some example distributions 012 0A0A0 A1A1 A2A2 1A0A0 A1A1 A2A2 2A0A0 A1A1 A2A2 3A0A0 A1A1 A2A2 ((x,y),z)fst.fst (…)HashPart ((1,0),1.0)11A1A1 ((1,1),2.3)11A1A1 ((2,1),1.5)22A2A2 ((3,1),3.5)30A0A0 ((3,2),1.0)30A0A0 ((4,0),3.1)41A1A1 ((4,4),0.5)41A1A1 ((5,1),0.0)52A2A2 ((6,2),5.7)60A0A0 ((6,3),2.1)60A0A0 2D node topology dim0 dim1
Some example distributions 012 0A0A0 A1A1 A2A2 1A0A0 A1A1 A2A2 2A0A0 A1A1 A2A2 3A0A0 A1A1 A2A2 ((x,y),z)snd (…)HashPart ((1,0),1.0)1.01A1A1 ((1,1),2.3)2.30A0A0 ((2,1),1.5)1.52A2A2 ((3,1),3.5)3.51A1A1 ((3,2),1.0)1.01A1A1 ((4,0),3.1)3.11A1A1 ((4,4),0.5)0.51A1A1 ((5,1),0.0)0.00A0A0 ((6,2),5.7)5.70A0A0 ((6,3),2.1)2.10A0A0 2D node topology dim0 dim1
Some example distributions 012 0A (0,0) A (1,0) A (2,0) 1A (0,1) A (1,1) A (2,1) 2A (0,2) A (1,2) A (2,2) 3A (0,3) A (1,3) A (2,3) ((x,y),z)fstHashPart ((1,0),1.0)(1,0) A (1,0) ((1,1),2.3)(1,1) A (1,1) ((2,1),1.5)(2,1) A (2,1) ((3,1),3.5)(3,1)(0,1)A (0,1) ((3,2),1.0)(3,2)(0,2)A (0,2) ((4,0),3.1)(4,0)(1,0)A (1,0) ((4,4),0.5)(4,4)(1,0)A (1,0) ((5,1),0.0)(5,1)(2,1)A (2,1) ((6,2),5.7)(6,2)(0,2)A (0,2) ((6,3),2.1)(6,3)(0,3)A (0,3) 2D node topology dim0 dim1
Distribution types for group reduces Input collection is partitioned using any projection function f –must exchange values between nodes before reduction Node1 Node2 Node3 In OutRe-partitionLocal greduce Result is always co-located by the result Map’s keys.
Distribution types for group reduces Input is partitioned using the groupReduce’s key projection function f –result keys are already co-located Node1 Node2 Node3 In OutLocal group reduce No inter-node communication is necessary (but constrains input distribution) Result is always co-located by the result Map’s keys.
Dependent type schemes Two different binders/schemes: ∀ binds a type variable create fresh variable when instantiated for term can be replaced by any type when unifying (flexible) used to specify that a collection is distributed by any partition function/arbitrarily similar to standard type schemes Π binds a concrete term in AST lifts the argument’s AST into the types when instantiated must match concrete term when unifying (rigid) used to specify that a collection is distributed using a specific projection function (non-standard dep type syntax)
Extending Damas-Milner inference We use a constraint based implementation of Algorithm W and extend it to: –Lift parameter functions into types at function apps (see DA PP ). –Compare embedded functions (extends U NIFY ). Undecidable in general. Currently use an under-approximation which simplifies functions and tests for syntactic equality. Future work solves equations involving projection functions.
Distribution types for joins Inputs are partitioned using the join’s key emitter functions –no inter-node communication, equal keys already co-located Node1 Node2 Node3 In Out Local equi-joins No inter-node communication Constrains inputs and outputs No mirroring of inputs (Like sort-merge join) Partitioned by f1 and f2 (i.e. aligned)
EXAMPLE Deriving a distributed memory matrix-multiply
let R = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), \(_,(av,bv)) -> mul (av,bv), add, R) map folded into groupReduce Deriving a distributed matrix-multiply eqJoin1 / eqJoin2 / eqJoin3 groupReduce1 / groupReduce 2 The program again:
Distributed matrix multiplication - #1 A: Partitioned by row along d1, mirrored along d2 B: Partitioned by column along d2, mirrored along d1 C: Partitioned by (row, column) along (d1, d2) Must partition and mirror A and B at beginning of computation.
Distributed matrix multiplication - #1 A: Partitioned by row along d1, mirrored along d2 B: Partitioned by column along d2, mirrored along d1 C: Partitioned by (row, column) along (d1, d2) A common solution. eqJoin3 groupReduce2 = R (1,1) R (1,2) R (1,3) R (2,1) R (2,2) R (2,3) R (3,1) R (3,2) R (3,3) A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 = C (1,1) C (1,2) C (1,3) C (2,1) C (2,2) C (2,3) C (3,1) C (3,2) C (3,3) d1 d2 (3-by-3 = 9nodes)
Distributed matrix multiplication - #2 A: Partitioned by col along d B: Partitioned by row along d (aligned with A) C: Partitioned by (row, column) along d Must exchange R during groupReduce1.
Distributed matrix multiplication - #2 A by col, B by row, so join is in-place but group must exchange values to group by key. Efficient for sparse matrices, where A and B are large, but R is small. A1A1 B1B1 A2A2 B2B2 A3A3 B3B3 RaRa RbRb RcRc R a1 R b1 R c1 ++= C1C1 R a2 R b2 R c2 ++= C2C2 R a3 R c3 ++= C3C3 eqJoin1groupReduce1 Node 1 Node 2 Node 3 = = = regrouped by col from B d (3 nodes)
IMPLEMENTATION & RESULTS Generated implementations
Prototype Implementation Uses a genetic search to explore candidate implementations. For each candidate we automatically insert redistributions to make it type check. We evaluate each candidate by code generating, compiling, and running it on some test data. Generates C++ using MPI. (~20k lines of Haskell)
Example programs Performance comparable with hand-coded versions: PLINQ comparisons run on quad-core (Intel Xeon W3520/2.67GHz) x64 desktop with 12GB memory. C++/MPI comparisons run on Iridis3&4: a 3 rd gen cluster with ~1000 Westmere compute nodes, each with two 6-core CPUs and 22GB of RAM, over an InfiniBand network. Speedups compared to sequential, averaged over 1,2,3,4,8,9,16,32 nodes.Iridis3&4
CONCLUSIONS So what?
Benefits Multiple distributed collections: Maps, Arrays, Lists... Can support in-memory and disk backed collections (e.g. for Big Data). Generates distributed algorithms automatically. Flexible communication Concise input language Abstracts away from complex details Limitations Currently data parallel algorithms must be expressed via combinators Data distributions are limited to those that can be expressed in the types The Pros and Cons…
Future work Improve prototype implementation. More distributed collection types and combinators. Better function equation solving during unification. Support more distributed memory architectures (GPUs). Retrofit into an existing functional language. Similar type inference for imperative languages? Suggestions?
QUESTIONS?