Approches fonctionnelles de la programmation parallèle Frédéric Gava Sous la direction de Frédéric Loulergue Approches fonctionnelles de la programmation parallèle et des méta-ordinateurs Sémantiques, implantations et certification
Background Parallel programming Implicit Explicit Concurrent Automatic parallelization Skeletons Data-parallelism Parallel extensions
Projects 2002-2004 ACI Grid 4 partners Design of parallel and grid libraries of primitives for OCaml with applications to distributed SGBD and numeric computations 2004-2007 ACI Young researchers Production of a programming environment in which certified parallel programs can be written, proved and safely executed
Outline Introduction Semantics of BSML and certification Extensions New primitives : parallel composition & parallel IO Library of parallel data structures Globalized operations Conclusion and future work
Introduction
The BSP model BSP architecture: Characterized by: Synchronization unit P/M Network Characterized by: p number of processors r processors speed L global synchronization g communication phase (1 word at most sent or received by each processor)
T(s) = (max0i<p wi) + hg + L BSP model of execution T(s) = (max0i<p wi) + hg + L
The BSML language -calculus ML BS-calculus Parallel constructions BSML Parallel primitives Structured parallelism as an explicit parallel extension of ML Functional language with BSP cost predictions Allows the implementation of skeletons Implemented as a parallel library for the "Objective Caml" language Using a parallel data structure called parallel vector
A BSML program fp-1 … f1 f0 gp-1 … g1 g0 Replicated part Parallel part Sequential part
Asynchronous primitives mkpar: (int ) par f (p-1) … (f 1) (f 0) (mkpar f ) apply: ( ) par par par fp-1 … f1 f0 vp-1 v1 v0 fp-1 vp-1 f1 v1 f0 v0 apply
Synchronous primitives put: (int option) par(int option) par None Some v4 Some v1 Some v3 Some v5 Some v2 3 2 1 put proj: option par(int option) vp-1 … v1 v0 proj f such that (f i)=vi
Semantics and certification
Outline Natural semantics Small steps semantics Distributed semantics Programming model Easy for proofs Natural semantics Small steps semantics Easy for costs Distributed semantics Make asynchronous steps appear Abstract machine Execution model Close to a real implementation
Mini language e ::= l.e functional core language | (e e) | … Expression of our mini language : e ::= l.e functional core language | (e e) | … | (mkpar e) parallel primitives | <e, e, … , e> parallel vector | (e)[s] substitution | l.e[s] closure
Natural semantics Confluent Semantics = set of axioms and inference rules Easy to understand, makes proofs more easy Example: Confluent
Small steps semantics Confluent (costs and values) Local costs Semantics = set of rewriting rules Using contexts for the strategy Easier understanding of costs and errors Example: Global cost Confluent (costs and values) Equivalent to the previous semantics
Distributed semantics Semantics = set of parallel rewriting rules SPMD style: Parallel vector Parts of the Parallel vector Small steps scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Distributed evaluation scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Confluent Equivalent to the previous semantics
Synchronous instruction Abstract machine BSP-CAM = p*CAM + BSP instructions (style SPMD) PUSH SWAP PID CONS APP SEND CAM COMMUNICATIONS PID of the machine for mkpar Synchronous instruction for put Minimal set of parallel instructions Equivalence with the distributed semantics
Certification of BSML programs The Coq Proof assistant: Typed-calculus with dependent types Specification = term (goal) Language of tactics to build a proof of this goal Extraction of the proof (certified program) BSML and Coq : Axiomatization of the primitive semantics in Coq Proof of BSML programs as usual proof of ML programs Certification and extraction of BSML programs: Broadcast, total exchange … Prefixes Sort
Example: replicate Specification of replicate: intros T a. exists (mkpar T (fun pid: Z a)). rewrite mkpar_def. Certified extraction: let replicate a = mkpar (fun pid a)
Extensions and parallel data structures
Parallel Data-structures Outline New primitive Divide-and-conquer Properties Parallel composition Confluent semantics Two equivalent semantics Implemented with BSML Parallel Data-structures Simplify programming OCaml interfaces Load-balancing External memory (IO) New primitives New cost model Property Confluent semantics
Multiprogramming Several programs on the same machine New primitives for parallel composition: Superposition Juxtaposition (implemented with the superposition) Divide-and-conquer BSP algorithms
Parallel superposition super : (unit ) (unit b) b super E1 E2 = (E1 (), E2()) Fusion of communications/synchronization Preserves the BSP model Pure functional semantics
Parallel superposition Confluent BSP Equivalence
Example: parallel prefixes Direct version (BSML+MPI) Superposition version Juxtaposition version Time(s) Size of the polynomials
Parallel data structures Observations: Data Structures are as important as algorithms Symbolic computations use these data structures massively A parallel implementation of data structures: Interfaces as close as possible to the sequential ones Modular implementation to get a straightforward maintenance Load-balancing of the data
Parallel data structures 5 modules: Set, Map, Stack, Queue, Hashtable Interfaces: Same as in OCaml With some specific parallel functions such as parallel reductions A parallel data structure = one data structure on each processor Manual or Automatic load-balancing: To get similar sizes of the local data structures Better performances for parallel iterations A two super-steps algorithm using histograms
Example Computation of the “nth” nearest neighbors atom in a molecule : Sequential version Parallel version (BSML+PUB) Time(s) Number of atoms
Example with load balancing Without balancing With balancing Time(s) Number of atoms
External memories Motivations : Measured Predicted Time(s) Number of elements
The EM-BSP model Disc 1 Processor Bus Disc 2 Memory Disc D P/M Network We add to the BSP model: D = the number of disks B = the size of the blocs O = latency of the disks G = time to read/write a byte
Shared disks Disc 1 Disc 2 Disc M P/M Network We add to the BSP model: With parameters similar to those of the local disks
External memory in BSML For safety, two kinds of files: local and global ones New primitives to manipulate these files (IO primitives) New semantics Confluent EM-BSP cost of the primitives
Modular implementation BSMLlib Primitives Std library Comm Super IO Parallel data structures Lower level PUB MPI TCP/IP Threads
Cost prediction Lists Arrays Predicted (max) Predicted (avg) Time(s) Number of elements
IO cost prediction Predicted BSML Predicted BSML-IO Measured BSML-IO Time(s) Number of elements
Globalized operations
+ Outline DMML BSML MSPML Semantics Cost models Implementations Desynchronize Semantics Cost models Implementations
MSPML Using the MPM model (parameters similar to that of BSP) But with a different execution model: Same language as BSML (parallel vector) but with new primitives of communication: put mget
MSPML Natural semantics Small steps semantics Distributed semantics Similar to BSML Programming model Easy for proofs Natural semantics Small steps semantics Similar to BSML Easy for costs Distributed semantics Very different Execution model Makes asynchronous steps appear
Asynchronous communications Proc. 0 1 2 0,v’’ 0,v’ 0,v Empty Local computation A bit later request 0 1 get v 1 v’ communication Environment of Communications
Asynchronous communications Proc. 0 1 2 0,v’’ 1,w’ 2,w’’ 0,v’ 0,v’ 1,w 0,v empty Not ready request 2 0
Departmental meta-computing BSML MSPML Intranet BSML BSML
Departmental Meta-computing ML BSML+ MSPML-like for coordination Two kinds of vectors: parallel vector: a par departmental vectors: a dep Operational semantics (confluent) Performance model (the DMM model) Implementation
Example: departmental prefixes Computation of the prefixes where each processor contains a value Naive method: each processor sends its value to other processors Better method: Each BSP unit computes a parallel prefix One processor of each BSP unit receives values of other units Each BSP unit finishes its computation with this value
Experiments Naive algorithm BSP algorithm (one cluster) Better algorithm Time(s) Size of the polynomials
Conclusion and future work
Conclusion Semantics of BSML: Expressivity: Meta-computing: Semantics Confluent and equivalent semantics Abstract machine Proof of BSML programs Expressivity: Parallel composition Parallel data structures Parallel IO Meta-computing: Desynchronization of BSML (MSPML) Departmental Meta-computing ML (DMML) Semantics Cost models Implementations
Future work in the Propac project Cost prediction: Static analysis of the programs Cost prediction of certified programs Proofs of BSP imperative programs: Coq Program correction BSML IMP ML Extension with BSP operations Extension of the logical assertions
Vérification efficace par Interaction de Techniques (VITE) Design of parallel model checkers for High-level Petri Nets Using BSML to implement a toolkit: Using the BSP model to dynamically load-balance Using a modular and generic implementation to ease the use of this toolkit Using the Propac tools to certify this implementation
Merci de votre attention
BSML and MSPML BSML MSPML MPM BSP Natural semantics Proofs of programs (with Coq) BSP MPM Natural semantics PUB MPI TCP/IP Small steps semantics Distributed semantics CAM Programming model Usefull for costs Execution model
Petri nets State Place Transition Token Arc
Parallel Implementation Propac High Level Semantics Parallel Semantics BSML Distributed evaluation Nat Step Distr Sequential Implemen- tation Coq Axioma- tisation Abstract Machines Design of BSP-CAM Parallel Implementation Proofs of BSML programs Performance model Dynamic cost analysis