Frédéric Gava Bulk-Synchronous Parallel ML

Slides:



Advertisements
Similar presentations
SDN Controller Challenges
Advertisements

Parallel and Distributed Simulation Global Virtual Time - Part 2.
Le modèle BSP Bulk-Synchronous Parallel Frédéric Gava.
1 1 Regression Verification for Multi-Threaded Programs Sagar Chaki, SEI-Pittsburgh Arie Gurfinkel, SEI-Pittsburgh Ofer Strichman, Technion-Haifa Originally.
2 nd Microsoft Rotor Workshop, Pisa, April 23-25, SCOOPLI for.NET: a library for concurrent object-oriented programming Volkan Arslan, Piotr Nienaltowski.
Denis Caromel1 Joint work with Ludovic Henrio – Eric Madelaine et. OASIS members OASIS Team INRIA -- CNRS - I3S – Univ. of Nice Sophia-Antipolis, IUF.
Introduction in algorithms and applications Introduction in algorithms and applications Parallel machines and architectures Parallel machines and architectures.
Dynamic adaptation of parallel codes Toward self-adaptable components for the Grid Françoise André, Jérémy Buisson & Jean-Louis Pazat IRISA / INSA de Rennes.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Parallel Programming Models and Paradigms
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science pH: A Parallel Dialect of Haskell Jim Cipar & Jacob Sorber University of Massachusetts.
Chapter 11: Distributed Processing Parallel programming Principles of parallel programming languages Concurrent execution –Programming constructs –Guarded.
PRASHANTHI NARAYAN NETTEM.
A Mystery Esterel –small no type inference, subtyping, … no recursion, functions, … no pointers, malloc, GC, … no complex data structures, libraries,
OPL: Our Pattern Language. Background Design Patterns: Elements of Reusable Object-Oriented Software o Introduced patterns o Very influential book Pattern.
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
F. Gava, HLPP 2005 Frédéric Gava A Modular Implementation of Parallel Data Structures in Bulk-Synchronous Parallel ML.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
COMS W4115 Programming Languages & Translators Maria Ayako Taku COMS W PLT Columbia University 1 April 24, 2013 Functional Programming.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
PAPP 2004 Gava1 Frédéric Gava Parallel I/O Bulk-Synchronous Parallel ML In.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
A visualisation and debugging tool for multi-active objects Ludovic Henrio, Justine Rochas LAMHA, Nov 2015.
A Functional Language for Departmental Metacomputing Frederic Gava & Frederic Loulergue Universite Paris Val de Marne Laboratory of Algorithms, Complexity.
AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.
Semantics of Minimally Synchronous Parallel ML Myrto Arapini, Frédéric Loulergue, Frédéric Gava and Frédéric Dabrowski LACL, Paris, France.
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Squeak, Newsqueak, Obliq PL Seminar 10/24/2007. Of mice and menus Cardelli and Pike ’85: Squeak: a language for communicating with mice Proof-of-concept.
ASYNCHRONOUS AND DETERMINISTIC OBJECTS ASP: Asynchronous Sequential Processes l Distributed objects l Asynchronous method calls l Futures and Wait-by-necessity.
An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa Yonezawa Laboratory.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Sub-fields of computer science. Sub-fields of computer science.
cs612/2002sp/projects/ CS612 Term Projects cs612/2002sp/projects/
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Last Class: Introduction
PRAM Model for Parallel Computation
CS5102 High Performance Computer Systems Thread-Level Parallelism
On the Duality of Operating System Structures
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.
Memory Consistency Models
Parallel Programming By J. H. Wang May 2, 2017.
Deadlock Freedom by Construction
Memory Consistency Models
Spatial Analysis With Big Data
Lecture 22 review PRAM: A model developed for parallel machines
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
Lecture 7: Introduction to Distributed Computing.
CSCE569 Parallel Computing
Summary Background Introduction in algorithms and applications
Approches fonctionnelles de la programmation parallèle
COMP60621 Fundamentals of Parallel and Distributed Systems
Probabilistic Methods in Concurrency Lecture 7 The probabilistic asynchronous p-calculus Catuscia Palamidessi
Frédéric Gava Bulk-Synchronous Parallel ML Implementation of the
A Refinement Calculus for Promela
Parallel Speedup.
Parallel programming in Java
COMP60611 Fundamentals of Parallel and Distributed Systems
Synchronizers Outline motivation principles and definitions
Presentation transcript:

Frédéric Gava Bulk-Synchronous Parallel ML Semantics and Implementation of the Parallel Juxtaposition

BSML Background Parallel programming Implicit Explicit Concurrent Automatic parallelization skeletons Data-parallelism Parallel extensions

Projects 2002-2004 ACI Grid LIFO, LACL, PPS, INRIA Design of parallel and Grid librairies for OCaml. 2004-2007 ACI « Young researchers » LIFO, LACL Production of a programming environment in which certified parallel programs can be written and safely executed.

Outline The BSML language Parallel compositions Superposition : types and semantics Juxtaposition : types and semantics Implementation of the juxtaposition Conclusion and future works

The BSML language

Unit of synchronization The BSP model BSP architecture: Unit of synchronization P/M Network Characterized by: p Number of processors r Processors speed L Global synchronization g Phase of communication (1 word at most sent of received by each processor)

T(s) = (max0i<p wi) + hg + L Model of execution T(s) = (max0i<p wi) + hg + L

Example : broadcast cost = png + L cost = 2ng + 2L Direct broadcast: cost = png + L Broadcast with 2 phases : cost = 2ng + 2L

The BSML language -calculus ML BS-calculus Parallel constructions BSML Parallel primitives Structured parallelism as an explicit parallel extension of ML Functional language with BSP cost predictions Allows the implementation of skeletons Implemented as a parallel library for the "Objective Caml" language Using a parallel data structure called parallel vector

A BSML program fp-1 … f1 f0 gp-1 … g1 g0 Replicated part Parallel part Sequential part

Parallel primitives of BSML Asynchronous primitives: Creation of a vector mkpar : (int  )   par Parallel point-wize application apply : (  ) par   par   par Synchronous and communications primitives: Communications put : (int option) par(int option) par Projection of values proj :  option par(int option)

Semantics Natural semantics Small-steps semantics Programming model Easy for proofs (Coq) Natural semantics Small-steps semantics Easy for costs Distributed semantics Execution model Make asynchronous steps appear Close to a real implemantation

Parallel compositions

Multi-programming Several programs on the same machine New primitives of parallel composition: Superposition Juxtaposition (implanted with the superposition) Divide-and-conquer BSP algorithms

Parallel Superposition super : (unit  )  (unit  b)    b super E1 E2  (E1 (), E2()) Fusion of communications/synchronisations using super-threads Keep the BSP model Pure functional semantics

Parallel Superposition

Parallel juxtaposition juxta : int(unit par)(unit  par)   par Fusion of communications/synchronisations on each sub-machine Keep the BSP model Side-effect on the number of processors v 0 v 1 v m-1 … v i v’ 0 v’ 1 v’ p-1-m … v’ j Juxta m v 0 v m-1 … v i v’ 0 v’ p-1-m v’ j =

Parallel juxtaposition Communications Synchronisation E2 Communications Synchronisation E1 Communications Synchronisation E3 = (juxta 3 E1 E2)

Distributed semantics Semantics = set of parallel rewriting rules SPMD style: Parallel vector Parts of the parallel vector Natural scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Distributed evaluation scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec in scan' 0 (bsp_p()-1) op vec scan op vec = (super (fun()->scan' fst mid Prog Confluent Equivalent

Implementation of the juxtapositon

Use of the superposition 2 references that contain the number of processors of a sub-machine and the real PID of the virtual processor 0 (on a sub-machine) Creation of uncompleted vectors Each sub-machine in a super-thread

Example, parallel prefixes scan: ()   par   par scan (+) <v0, …, vp-1> = <v0, v0+v1, …, v0+v1+…+vp-1> a c e g op a b op c d op e f op g h Processors op v v’ v

Juxta versu Super Code of a direct method : 12 lines Code with superposition : 8 lines Code with juxtaposition : 6 lines

Performances Time (s) Direct method (BSML+MPI) D-a-C method with superposition D-a-C method with juxtaposition Time (s) Size of the polynomials

Conclusion and future works

Conclusion BSML=BSP+ML Superposition = primitive of parallel composition Juxtaposition is easier for divide-and-conquer algorithms Distributed semantics of the juxtaposition Juxtaposition implemented using superposition Similar performances

Future works Proofs of the implementation using semantics Implentation of bigger algorithms BSP model-checking of high-level Petri-nets (M-nets)

Thanks for your attention