Approches fonctionnelles de la programmation parallèle

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Practical techniques & Examples
Architecture Representation
Le modèle BSP Bulk-Synchronous Parallel Frédéric Gava.
An Associative Broadcast Based Coordination Model for Distributed Processes James C. Browne Kevin Kane Hongxia Tian Department of Computer Sciences The.
Reference: Message Passing Fundamentals.
Dynamic adaptation of parallel codes Toward self-adaptable components for the Grid Françoise André, Jérémy Buisson & Jean-Louis Pazat IRISA / INSA de Rennes.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Programming Language Semantics Mooly SagivEran Yahav Schrirber 317Open space html://
Semantics with Applications Mooly Sagiv Schrirber html:// Textbooks:Winskel The.
Operational Semantics Semantics with Applications Chapter 2 H. Nielson and F. Nielson
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Mapping Techniques for Load Balancing
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
F. Gava, HLPP 2005 Frédéric Gava A Modular Implementation of Parallel Data Structures in Bulk-Synchronous Parallel ML.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
VHDL Symbolic simulator in OCaml Florent Ouchet TIMA Labs – GINP – UJF – CNRS – VDS group OCaml Meeting 2009.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
PAPP 2004 Gava1 Frédéric Gava Parallel I/O Bulk-Synchronous Parallel ML In.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
A Functional Language for Departmental Metacomputing Frederic Gava & Frederic Loulergue Universite Paris Val de Marne Laboratory of Algorithms, Complexity.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Semantics of Minimally Synchronous Parallel ML Myrto Arapini, Frédéric Loulergue, Frédéric Gava and Frédéric Dabrowski LACL, Paris, France.
Parallel Computing Presented by Justin Reschke
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
Sub-fields of computer science. Sub-fields of computer science.
Functional Programming
Advanced Computer Systems
Software Architecture
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.
Alternative system models
Parallel Programming By J. H. Wang May 2, 2017.
Threads Cannot Be Implemented As a Library
Software Architecture
Parallel Programming in C with MPI and OpenMP
Summary Background Introduction in algorithms and applications
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Parallelization of An Example Program
COMP60621 Fundamentals of Parallel and Distributed Systems
Chapter 2: Operating-System Structures
Frédéric Gava Bulk-Synchronous Parallel ML Implementation of the
The Vector-Thread Architecture
A Refinement Calculus for Promela
Frédéric Gava Bulk-Synchronous Parallel ML
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Parallel Programming in C with MPI and OpenMP
Chapter 2: Operating-System Structures
Programming Languages and Compilers (CS 421)
COMP60611 Fundamentals of Parallel and Distributed Systems
Segments Introduction: slides minutes
COT 5611 Operating Systems Design Principles Spring 2014
Lecture 12 Input/Output (programmer view)
Drew Wyborski Programming Languages
Software Architecture
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Approches fonctionnelles de la programmation parallèle Frédéric Gava Sous la direction de Frédéric Loulergue Approches fonctionnelles de la programmation parallèle et des méta-ordinateurs Sémantiques, implantations et certification

Background Parallel programming Implicit Explicit Concurrent Automatic parallelization Skeletons Data-parallelism Parallel extensions

Projects 2002-2004 ACI Grid 4 partners Design of parallel and grid libraries of primitives for OCaml with applications to distributed SGBD and numeric computations 2004-2007 ACI Young researchers Production of a programming environment in which certified parallel programs can be written, proved and safely executed

Outline Introduction Semantics of BSML and certification Extensions New primitives : parallel composition & parallel IO Library of parallel data structures Globalized operations Conclusion and future work

Introduction

The BSP model BSP architecture: Characterized by: Synchronization unit P/M Network Characterized by: p number of processors r processors speed L global synchronization g communication phase (1 word at most sent or received by each processor)

T(s) = (max0i<p wi) + hg + L BSP model of execution T(s) = (max0i<p wi) + hg + L

The BSML language -calculus ML BS-calculus Parallel constructions BSML Parallel primitives Structured parallelism as an explicit parallel extension of ML Functional language with BSP cost predictions Allows the implementation of skeletons Implemented as a parallel library for the "Objective Caml" language Using a parallel data structure called parallel vector

A BSML program fp-1 … f1 f0 gp-1 … g1 g0 Replicated part Parallel part Sequential part

Asynchronous primitives mkpar: (int  )   par f (p-1) … (f 1) (f 0) (mkpar f ) apply: (  ) par   par   par fp-1 … f1 f0 vp-1 v1 v0 fp-1 vp-1 f1 v1 f0 v0 apply

Synchronous primitives put: (int option) par(int option) par None Some v4 Some v1 Some v3 Some v5 Some v2 3 2 1 put proj:  option par(int option) vp-1 … v1 v0 proj f such that (f i)=vi

Semantics and certification

Outline Natural semantics Small steps semantics Distributed semantics Programming model Easy for proofs Natural semantics Small steps semantics Easy for costs Distributed semantics Make asynchronous steps appear Abstract machine Execution model Close to a real implementation

Mini language e ::= l.e functional core language | (e e) | … Expression of our mini language : e ::= l.e functional core language | (e e) | … | (mkpar e) parallel primitives | <e, e, … , e> parallel vector | (e)[s] substitution | l.e[s] closure

Natural semantics Confluent Semantics = set of axioms and inference rules Easy to understand, makes proofs more easy Example: Confluent

Small steps semantics Confluent (costs and values) Local costs Semantics = set of rewriting rules Using contexts for the strategy Easier understanding of costs and errors Example: Global cost Confluent (costs and values) Equivalent to the previous semantics

Distributed semantics Semantics = set of parallel rewriting rules SPMD style: Parallel vector Parts of the Parallel vector Small steps scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Distributed evaluation scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Confluent Equivalent to the previous semantics

Synchronous instruction Abstract machine BSP-CAM = p*CAM + BSP instructions (style SPMD) PUSH SWAP PID CONS APP SEND CAM COMMUNICATIONS PID of the machine for mkpar Synchronous instruction for put Minimal set of parallel instructions Equivalence with the distributed semantics

Certification of BSML programs The Coq Proof assistant: Typed-calculus with dependent types Specification = term (goal) Language of tactics to build a proof of this goal Extraction of the proof (certified program) BSML and Coq : Axiomatization of the primitive semantics in Coq Proof of BSML programs as usual proof of ML programs Certification and extraction of BSML programs: Broadcast, total exchange … Prefixes Sort

Example: replicate Specification of replicate: intros T a. exists (mkpar T (fun pid: Z  a)). rewrite mkpar_def. Certified extraction: let replicate a = mkpar (fun pid  a)

Extensions and parallel data structures

Parallel Data-structures Outline New primitive Divide-and-conquer Properties Parallel composition Confluent semantics Two equivalent semantics Implemented with BSML Parallel Data-structures Simplify programming OCaml interfaces Load-balancing External memory (IO) New primitives New cost model Property Confluent semantics

Multiprogramming Several programs on the same machine New primitives for parallel composition: Superposition Juxtaposition (implemented with the superposition) Divide-and-conquer BSP algorithms

Parallel superposition super : (unit  )  (unit  b)    b super E1 E2 = (E1 (), E2()) Fusion of communications/synchronization Preserves the BSP model Pure functional semantics

Parallel superposition Confluent BSP Equivalence

Example: parallel prefixes Direct version (BSML+MPI) Superposition version Juxtaposition version Time(s) Size of the polynomials

Parallel data structures Observations: Data Structures are as important as algorithms Symbolic computations use these data structures massively A parallel implementation of data structures: Interfaces as close as possible to the sequential ones Modular implementation to get a straightforward maintenance Load-balancing of the data

Parallel data structures 5 modules: Set, Map, Stack, Queue, Hashtable Interfaces: Same as in OCaml With some specific parallel functions such as parallel reductions A parallel data structure = one data structure on each processor Manual or Automatic load-balancing: To get similar sizes of the local data structures Better performances for parallel iterations A two super-steps algorithm using histograms

Example Computation of the “nth” nearest neighbors atom in a molecule : Sequential version Parallel version (BSML+PUB) Time(s) Number of atoms

Example with load balancing Without balancing With balancing Time(s) Number of atoms

External memories Motivations : Measured Predicted Time(s) Number of elements

The EM-BSP model Disc 1 Processor Bus Disc 2 Memory Disc D P/M Network We add to the BSP model: D = the number of disks B = the size of the blocs O = latency of the disks G = time to read/write a byte

Shared disks Disc 1 Disc 2 Disc M P/M Network We add to the BSP model: With parameters similar to those of the local disks

External memory in BSML For safety, two kinds of files: local and global ones New primitives to manipulate these files (IO primitives) New semantics Confluent EM-BSP cost of the primitives

Modular implementation BSMLlib Primitives Std library Comm Super IO Parallel data structures Lower level PUB MPI TCP/IP Threads

Cost prediction Lists Arrays Predicted (max) Predicted (avg) Time(s) Number of elements

IO cost prediction Predicted BSML Predicted BSML-IO Measured BSML-IO Time(s) Number of elements

Globalized operations

+ Outline DMML BSML MSPML Semantics Cost models Implementations Desynchronize Semantics Cost models Implementations

MSPML Using the MPM model (parameters similar to that of BSP) But with a different execution model: Same language as BSML (parallel vector) but with new primitives of communication: put  mget

MSPML Natural semantics Small steps semantics Distributed semantics Similar to BSML Programming model Easy for proofs Natural semantics Small steps semantics Similar to BSML Easy for costs Distributed semantics Very different Execution model Makes asynchronous steps appear

Asynchronous communications Proc. 0 1 2 0,v’’ 0,v’ 0,v Empty Local computation A bit later request 0 1 get v 1 v’ communication Environment of Communications

Asynchronous communications Proc. 0 1 2 0,v’’ 1,w’ 2,w’’ 0,v’ 0,v’ 1,w 0,v empty Not ready request 2 0

Departmental meta-computing BSML MSPML Intranet BSML BSML

Departmental Meta-computing ML BSML+ MSPML-like for coordination Two kinds of vectors: parallel vector: a par departmental vectors: a dep Operational semantics (confluent) Performance model (the DMM model) Implementation

Example: departmental prefixes Computation of the prefixes where each processor contains a value Naive method: each processor sends its value to other processors Better method: Each BSP unit computes a parallel prefix One processor of each BSP unit receives values of other units Each BSP unit finishes its computation with this value

Experiments Naive algorithm BSP algorithm (one cluster) Better algorithm Time(s) Size of the polynomials

Conclusion and future work

Conclusion Semantics of BSML: Expressivity: Meta-computing: Semantics Confluent and equivalent semantics Abstract machine Proof of BSML programs Expressivity: Parallel composition Parallel data structures Parallel IO Meta-computing: Desynchronization of BSML (MSPML) Departmental Meta-computing ML (DMML) Semantics Cost models Implementations

Future work in the Propac project Cost prediction: Static analysis of the programs Cost prediction of certified programs Proofs of BSP imperative programs: Coq Program correction BSML IMP ML Extension with BSP operations Extension of the logical assertions

Vérification efficace par Interaction de Techniques (VITE) Design of parallel model checkers for High-level Petri Nets Using BSML to implement a toolkit: Using the BSP model to dynamically load-balance Using a modular and generic implementation to ease the use of this toolkit Using the Propac tools to certify this implementation

Merci de votre attention

BSML and MSPML BSML MSPML MPM BSP Natural semantics Proofs of programs (with Coq) BSP MPM Natural semantics PUB MPI TCP/IP Small steps semantics Distributed semantics CAM Programming model Usefull for costs Execution model

Petri nets State Place Transition Token Arc

Parallel Implementation Propac High Level Semantics Parallel Semantics BSML Distributed evaluation Nat Step Distr Sequential Implemen- tation Coq Axioma- tisation Abstract Machines Design of BSP-CAM Parallel Implementation Proofs of BSML programs Performance model Dynamic cost analysis