Frédéric Gava Bulk-Synchronous Parallel ML Semantics and Implementation of the Parallel Juxtaposition
BSML Background Parallel programming Implicit Explicit Concurrent Automatic parallelization skeletons Data-parallelism Parallel extensions
Projects 2002-2004 ACI Grid LIFO, LACL, PPS, INRIA Design of parallel and Grid librairies for OCaml. 2004-2007 ACI « Young researchers » LIFO, LACL Production of a programming environment in which certified parallel programs can be written and safely executed.
Outline The BSML language Parallel compositions Superposition : types and semantics Juxtaposition : types and semantics Implementation of the juxtaposition Conclusion and future works
The BSML language
Unit of synchronization The BSP model BSP architecture: Unit of synchronization P/M Network Characterized by: p Number of processors r Processors speed L Global synchronization g Phase of communication (1 word at most sent of received by each processor)
T(s) = (max0i<p wi) + hg + L Model of execution T(s) = (max0i<p wi) + hg + L
Example : broadcast cost = png + L cost = 2ng + 2L Direct broadcast: cost = png + L Broadcast with 2 phases : cost = 2ng + 2L
The BSML language -calculus ML BS-calculus Parallel constructions BSML Parallel primitives Structured parallelism as an explicit parallel extension of ML Functional language with BSP cost predictions Allows the implementation of skeletons Implemented as a parallel library for the "Objective Caml" language Using a parallel data structure called parallel vector
A BSML program fp-1 … f1 f0 gp-1 … g1 g0 Replicated part Parallel part Sequential part
Parallel primitives of BSML Asynchronous primitives: Creation of a vector mkpar : (int ) par Parallel point-wize application apply : ( ) par par par Synchronous and communications primitives: Communications put : (int option) par(int option) par Projection of values proj : option par(int option)
Semantics Natural semantics Small-steps semantics Programming model Easy for proofs (Coq) Natural semantics Small-steps semantics Easy for costs Distributed semantics Execution model Make asynchronous steps appear Close to a real implemantation
Parallel compositions
Multi-programming Several programs on the same machine New primitives of parallel composition: Superposition Juxtaposition (implanted with the superposition) Divide-and-conquer BSP algorithms
Parallel Superposition super : (unit ) (unit b) b super E1 E2 (E1 (), E2()) Fusion of communications/synchronisations using super-threads Keep the BSP model Pure functional semantics
Parallel Superposition
Parallel juxtaposition juxta : int(unit par)(unit par) par Fusion of communications/synchronisations on each sub-machine Keep the BSP model Side-effect on the number of processors v 0 v 1 v m-1 … v i v’ 0 v’ 1 v’ p-1-m … v’ j Juxta m v 0 v m-1 … v i v’ 0 v’ p-1-m v’ j =
Parallel juxtaposition Communications Synchronisation E2 Communications Synchronisation E1 Communications Synchronisation E3 = (juxta 3 E1 E2)
Distributed semantics Semantics = set of parallel rewriting rules SPMD style: Parallel vector Parts of the parallel vector Natural scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Distributed evaluation scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Confluent Equivalent
Implementation of the juxtapositon
Use of the superposition 2 references that contain the number of processors of a sub-machine and the real PID of the virtual processor 0 (on a sub-machine) Creation of uncompleted vectors Each sub-machine in a super-thread
Example, parallel prefixes scan: () par par scan (+) <v0, …, vp-1> = <v0, v0+v1, …, v0+v1+…+vp-1> a c e g op a b op c d op e f op g h Processors op v v’ v
Juxta versu Super Code of a direct method : 12 lines Code with superposition : 8 lines Code with juxtaposition : 6 lines
Performances Time (s) Direct method (BSML+MPI) D-a-C method with superposition D-a-C method with juxtaposition Time (s) Size of the polynomials
Conclusion and future works
Conclusion BSML=BSP+ML Superposition = primitive of parallel composition Juxtaposition is easier for divide-and-conquer algorithms Distributed semantics of the juxtaposition Juxtaposition implemented using superposition Similar performances
Future works Proofs of the implementation using semantics Implentation of bigger algorithms BSP model-checking of high-level Petri-nets (M-nets)
Thanks for your attention