Frédéric Gava Bulk-Synchronous Parallel ML Implementation of the

1 Frédéric Gava Bulk-Synchronous Parallel ML Implementation of the
Parallel Superposition

2 BSML Background Parallel programming Implicit Explicit Concurrent
Automatic parallelization skeletons Data-parallelism Parallel extensions

3 Projects 2002-2004 ACI Grid LIFO, LACL, PPS, INRIA
Design of parallel and Grid librairies for OCaml. ACI « Young researchers » LIFO, LACL Production of a programming environment in which certified parallel programs can be written and safely executed.

4 Outline The BSML language Multi-programming (superposition)
Implementation of the superposition Conclusion and future works

5 The BSML language

6 The BSML « spirite » Bugs grow faster than Moore’s law. (G. Berry)
High-level language  lines of code  number of bugd Certified library  number of bugs Small is beautiful. (R. H. Bisseling) BSML only use 5 primitives… Who would drive a non-deterministic car ? (G. Berry) Propriety of confluence of the semantic of BSML French Proverb : « All the roads go to Roma » But the better way is to choose the shorter One can give BSP costs to BSML programs Different of concurrent programming : cost and confluence

7 Unit of synchronization
The BSP model BSP architecture: Unit of synchronization P/M Network Characterized by: p Number of processors r Processors speed L Global synchronization g Phase of communication (1 word at most sent of received by each processor)

8 Model of execution wi ghi L wi+1 ghi+1 L Super-step i Super-step i+1
Beginning of the super-step i Super-step i wi Local computing on each processor Global (collective) communications between processors ghi L Global synchronization : exchanged data available for the next super-step Super-step i+1 wi+1 ghi+1 Cost(i) = (max0x<p wxi) + hig + L L

9 Example : broadcast BSP cost = png + L BSP cost = 2ng + 2L
Direct broadcast (one super-step): BSP cost = png + L Broadcast with 2 super-steps: BSP cost = 2ng + 2L

10 The BSML language -calculus ML BS-calculus Parallel constructions BSML Parallel primitives Structured parallelism as an explicit parallel extension of ML Functional language with BSP cost predictions Allows the implementation of skeletons Implemented as a parallel library for the "Objective Caml" language Using a parallel data structure called parallel vector

11 A BSML program fp-1 … f1 f0 gp-1 … g1 g0 Replicated part Parallel part
Sequential part

12 Parallel primitives of BSML
Asynchronous primitives: Creation of a vector (creation of local values) mkpar : (int  )   par Parallel point-wize application apply : (  ) par   par   par Synchronous and communications primitives: Communications put : (int) par  (int) par Projection of local values (to be replicated) proj :  par  (int)

13 Semantics Natural semantics Small-steps semantics
Programming model Easy for proofs (Coq) Natural semantics Small-steps semantics Easy for costs Distributed semantics Execution model Make asynchronous steps appear Close to a real implemantation

14 Natural semantics Semantics = set of axioms and inference rules
Easy to understand, makes proofs more easy Example:

15 Small steps semantics Semantics = set of rewriting rules
Local costs Semantics = set of rewriting rules Using contexts for the strategy Easier understanding of costs and errors Example: Global cost

16 Distributed semantics
Semantics = set of parallel rewriting rules SPMD style: Parallel vector Parts of the parallel vector Natural scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Distributed evaluation scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog

17 Multi-programming

18 Parallel composition Several programs on the same machine
Primitive of parallel composition: Superposition Divide-and-conquer BSP algorithms

19 Parallel Superposition
super : (unit  )  (unit  b)    b super E1 E2  (E1 (), E2()) Fusion of communications/synchronisations using super-threads Keep the BSP model Pure functional semantics

20 Parallel Superposition

21 Implementation of the superposition

22 Semantics (1) Natural semantics : Small-step semantics:
Solution, the super-threads :

23 Semantics (2) Management of the communications :
Management of the superposition :

24 Semantics based implementation
The semantics makes appear 3 low level primitives : Send to send the data of the environment of communication Rcv to received them Wait to allow a super-thread to wait his brother BSML primitives are thus simple calls of them (as in the small-steps semantics) Super-threads could be implemented using threads A scheduler of this threads is thus need for the special management of our super-threads The environment of communications is just a Hashtable with pid of super-threads as keys

25 Example, prefixes calculus
scan : ()   par   par scan (+) <v0, …, vp-1> = <v0, v0+v1, …, v0+v1+…+ vp-1> scan (+) <v0, …, vm, …> = < w0 , … , wm , … > scan (+) <… ,vm+1, …, vp-1> =<…, wm+1 , … , wp+1> < w0 , … , wm , wm+wm+1, … , wm+wp+1> = <v0, v0+v1, v0+…+vm, v0+…+vm+1,…, v0+…+vp-1>

26 Benchmarks Time (s) Direct method (BSML+MPI)
D-a-C method with superposition D-a-C method with juxtaposition Time (s) Size of the polynomials

27 Conclusion and future works

28 Conclusion BSML=BSP+ML
Superposition = primitive of parallel composition Small-step semantics of the superposition Distributed semantics as small one Superposition implemented using threads as in the small-step semantics

29 Future works Implementation using continuation (transformation of source’s code with the help of a type checker) and proof of equivalence using our semantics Implentation of bigger algorithms for better benchmarks of BSML and its superposition Implementation of parallel skeletons (management of tasks) using the superposition ? BSP model-checking of high-level Petri-nets (M-nets). The main difficult : find a non-trivial algorithm as the community of concurrent programming does. Possible but need more theoretical optimisations…

30 Thanks for your attention

