Tristan Aubrey-Jones Bernd Fischer Synthesizing MPI Implementations from Functional Data-Parallel Programs.

Slides:

Advertisements

Similar presentations

Speed, Accurate and Efficient way to identify the DNA.

Advertisements

Introduction to Compilation of Functional Languages Wanhe Zhang Computing and Software Department McMaster University 16 th, March, 2004.

What is a Database By: Cristian Dubon.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Architecture Representation

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Study of a Paper about Genetic Algorithm For CS8995 Parallel Programming Yanhua Li.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

CUDA Grids, Blocks, and Threads

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Contemporary Languages in Parallel Computing Raymond Hummel.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

A Parametric Segmentation Functor for Fully Automatic and Scalable Array Content Analysis Patrick Cousot, NYU & ENS Radhia Cousot, CNRS & ENS & MSR Francesco.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

MRPGA ： An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter ：古乃卉.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

The Fundamentals: Algorithms, the Integers & Matrices.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

Class Opener:. Identifying Matrices Student Check:

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Matrix Multiplication in Hadoop

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Amir Shaikhha, Mohammed ElSeidy, Daniel Espino, and Christoph Koch

Database Management System

Higher-Order Procedures

Objective of This Course

Algorithm Discovery and Design

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Types and Type Checking (What is it good for?)

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

Presentation transcript:

Tristan Aubrey-Jones Bernd Fischer Synthesizing MPI Implementations from Functional Data-Parallel Programs

People need to program distributed memory architectures: GPUs Future many-core architectures? Server farms/compute clusters Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband Graphics/GPGPU CUDA/OpenCL Memory levels: thread-local, block-shared, device-global. Our focus: Programming shared- nothing clusters

Our Aim To automatically synthesize distributed memory implementations (C++/MPI). We want this not just for arrays (i.e. other collection types & disk backed collections). Many distributions possible Single threadedParallel shared-memoryDistributed memory

Our Approach We define Flocc, a data-parallel DSL, where parallelism is expressed via combinator functions. We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations. We use extended types to carry data distribution information, and type inference to infer data distribution information.

FLOCC A data-parallel DSL (Functional language on compute clusters)

Flocc syntax Polyadic lambda calculus, standard call-by-value semantics.

Matrix multiplication in Flocc let R1 = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in … eqJoin :: ((k ₁,v ₁ ) → x, (k ₂,v ₂ ) → x, Map k ₁ v ₁, Map k ₂ v ₂ ) → Map (k ₂,k ₂ ) (v ₂,v ₂ ) SELECT * FROM A JOIN B ON A.j = B.i; Data-parallelism via combinator functions (abstracts away from individual element access)

Matrix multiplication in Flocc let R1 = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in let R2 = map (id, \(_,(av,bv)) -> mul (av,bv), R1) in … eqJoin :: ((k ₁,v ₁ ) → x, (k ₂,v ₂ ) → x, Map k ₁ v ₁, Map k ₂ v ₂ ) → Map (k ₂,k ₂ ) (v ₂,v ₂ ) map :: (k ₁ → k ₂, (k ₁,v ₁ ) →v ₂, Map k ₁ v ₁ ) → Map k ₂ v ₂ SELECT A.i as i, B.j as j, A.v * B.v as v FROM A JOIN B ON A.j = B.i;

Matrix multiplication in Flocc let R1 = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in let R2 = map (id, \(_,(av,bv)) -> mul (av,bv), R1) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), snd, add, R2) eqJoin :: ((k ₁,v ₁ ) → x, (k ₂,v ₂ ) → x, Map k ₁ v ₁, Map k ₂ v ₂ ) → Map (k ₂,k ₂ ) (v ₂,v ₂ ) map :: (k ₁ → k ₂, (k ₁,v ₁ ) →v ₂, Map k ₁ v ₁ ) → Map k ₂ v ₂ groupReduce :: ((k ₁,v ₁ ) →k ₂, (k ₁,v ₁ ) →v ₂, (v ₂,v ₂ ) →v ₂, Map k ₁ v ₁ ) → Map k ₂ v ₂ SELECT A.i as i, B.j as j, sum (A.v * B.v) as v FROM A JOIN B ON A.j = B.i GROUP BY A.i, B.j;

DISTRIBUTION TYPES Distributed data layout (DDL) types

let R = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), \(_,(av,bv)) -> mul (av,bv), add, R) map folded into groupReduce Deriving distribution plans eqJoin1 / eqJoin2 / eqJoin3 groupReduce1 / groupReduce 2 Different distributed implementations of each combinator. Enumerates different choices of combinator implementations. Each combinator implementation has a DDL type. Use type inference to check if a set of choices is sound, insert redistributions to make sound, and infer data distributions. Have backend template for each combinator implementation. Hidden from user

A Distributed Map type The DMap type extends the basic Map k v type, to symbolically describe how the Map should be distributed on the cluster. Key and value types t1 and t2. Partition function f: a function taking key-value pairs to node coordinates i.e. f : (t1,t2)  N Partition dimension identifier d1: specifies which nodes the coordinates map onto. Mirror dimension identifier d2: specifies which nodes to mirror the partitions over. Also includes local layout mode and key, omitted for brevity. Also works for Arrays (DArr) and Lists (DList).

Some example distributions 012 0A0A0 A1A1 A2A ((x,y),z)fst.fst (…)HashPart ((1,0),1.0)11A1A1 ((1,1),2.3)11A1A1 ((2,1),1.5)22A2A2 ((3,1),3.5)30A0A0 ((3,2),1.0)30A0A0 ((4,0),3.1)41A1A1 ((4,4),0.5)41A1A1 ((5,1),0.0)52A2A2 ((6,2),5.7)60A0A0 ((6,3),2.1)60A0A0 2D node topology dim0 dim1

Some example distributions 012 0A0A0 A1A1 A2A2 1A0A0 A1A1 A2A2 2A0A0 A1A1 A2A2 3A0A0 A1A1 A2A2 ((x,y),z)fst.fst (…)HashPart ((1,0),1.0)11A1A1 ((1,1),2.3)11A1A1 ((2,1),1.5)22A2A2 ((3,1),3.5)30A0A0 ((3,2),1.0)30A0A0 ((4,0),3.1)41A1A1 ((4,4),0.5)41A1A1 ((5,1),0.0)52A2A2 ((6,2),5.7)60A0A0 ((6,3),2.1)60A0A0 2D node topology dim0 dim1

Some example distributions 012 0A0A0 A1A1 A2A2 1A0A0 A1A1 A2A2 2A0A0 A1A1 A2A2 3A0A0 A1A1 A2A2 ((x,y),z)snd (…)HashPart ((1,0),1.0)1.01A1A1 ((1,1),2.3)2.30A0A0 ((2,1),1.5)1.52A2A2 ((3,1),3.5)3.51A1A1 ((3,2),1.0)1.01A1A1 ((4,0),3.1)3.11A1A1 ((4,4),0.5)0.51A1A1 ((5,1),0.0)0.00A0A0 ((6,2),5.7)5.70A0A0 ((6,3),2.1)2.10A0A0 2D node topology dim0 dim1

Some example distributions 012 0A (0,0) A (1,0) A (2,0) 1A (0,1) A (1,1) A (2,1) 2A (0,2) A (1,2) A (2,2) 3A (0,3) A (1,3) A (2,3) ((x,y),z)fstHashPart ((1,0),1.0)(1,0) A (1,0) ((1,1),2.3)(1,1) A (1,1) ((2,1),1.5)(2,1) A (2,1) ((3,1),3.5)(3,1)(0,1)A (0,1) ((3,2),1.0)(3,2)(0,2)A (0,2) ((4,0),3.1)(4,0)(1,0)A (1,0) ((4,4),0.5)(4,4)(1,0)A (1,0) ((5,1),0.0)(5,1)(2,1)A (2,1) ((6,2),5.7)(6,2)(0,2)A (0,2) ((6,3),2.1)(6,3)(0,3)A (0,3) 2D node topology dim0 dim1

Distribution types for group reduces Input collection is partitioned using any projection function f –must exchange values between nodes before reduction Node1 Node2 Node3 In OutRe-partitionLocal greduce Result is always co-located by the result Map’s keys.

Distribution types for group reduces Input is partitioned using the groupReduce’s key projection function f –result keys are already co-located Node1 Node2 Node3 In OutLocal group reduce No inter-node communication is necessary (but constrains input distribution) Result is always co-located by the result Map’s keys.

Dependent type schemes Two different binders/schemes: ∀ binds a type variable create fresh variable when instantiated for term can be replaced by any type when unifying (flexible) used to specify that a collection is distributed by any partition function/arbitrarily similar to standard type schemes Π binds a concrete term in AST lifts the argument’s AST into the types when instantiated must match concrete term when unifying (rigid) used to specify that a collection is distributed using a specific projection function (non-standard dep type syntax)

Extending Damas-Milner inference We use a constraint based implementation of Algorithm W and extend it to: –Lift parameter functions into types at function apps (see DA PP ). –Compare embedded functions (extends U NIFY ).  Undecidable in general.  Currently use an under-approximation which simplifies functions and tests for syntactic equality.  Future work solves equations involving projection functions.

Distribution types for joins Inputs are partitioned using the join’s key emitter functions –no inter-node communication, equal keys already co-located Node1 Node2 Node3 In Out Local equi-joins No inter-node communication Constrains inputs and outputs No mirroring of inputs (Like sort-merge join) Partitioned by f1 and f2 (i.e. aligned)

EXAMPLE Deriving a distributed memory matrix-multiply

let R = eqJoin (\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), \(_,(av,bv)) -> mul (av,bv), add, R) map folded into groupReduce Deriving a distributed matrix-multiply eqJoin1 / eqJoin2 / eqJoin3 groupReduce1 / groupReduce 2 The program again:

Distributed matrix multiplication - #1 A: Partitioned by row along d1, mirrored along d2 B: Partitioned by column along d2, mirrored along d1 C: Partitioned by (row, column) along (d1, d2) Must partition and mirror A and B at beginning of computation.

Distributed matrix multiplication - #1 A: Partitioned by row along d1, mirrored along d2 B: Partitioned by column along d2, mirrored along d1 C: Partitioned by (row, column) along (d1, d2) A common solution. eqJoin3 groupReduce2 = R (1,1) R (1,2) R (1,3) R (2,1) R (2,2) R (2,3) R (3,1) R (3,2) R (3,3) A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 = C (1,1) C (1,2) C (1,3) C (2,1) C (2,2) C (2,3) C (3,1) C (3,2) C (3,3) d1 d2 (3-by-3 = 9nodes)

Distributed matrix multiplication - #2 A: Partitioned by col along d B: Partitioned by row along d (aligned with A) C: Partitioned by (row, column) along d Must exchange R during groupReduce1.

Distributed matrix multiplication - #2 A by col, B by row, so join is in-place but group must exchange values to group by key. Efficient for sparse matrices, where A and B are large, but R is small. A1A1 B1B1 A2A2 B2B2 A3A3 B3B3 RaRa RbRb RcRc R a1 R b1 R c1 ++= C1C1 R a2 R b2 R c2 ++= C2C2 R a3 R c3 ++= C3C3 eqJoin1groupReduce1 Node 1 Node 2 Node 3 = = = regrouped by col from B d (3 nodes)

IMPLEMENTATION & RESULTS Generated implementations

Prototype Implementation Uses a genetic search to explore candidate implementations. For each candidate we automatically insert redistributions to make it type check. We evaluate each candidate by code generating, compiling, and running it on some test data. Generates C++ using MPI. (~20k lines of Haskell)

Example programs Performance comparable with hand-coded versions: PLINQ comparisons run on quad-core (Intel Xeon W3520/2.67GHz) x64 desktop with 12GB memory. C++/MPI comparisons run on Iridis3&4: a 3 rd gen cluster with ~1000 Westmere compute nodes, each with two 6-core CPUs and 22GB of RAM, over an InfiniBand network. Speedups compared to sequential, averaged over 1,2,3,4,8,9,16,32 nodes.Iridis3&4

CONCLUSIONS So what?

Benefits Multiple distributed collections: Maps, Arrays, Lists... Can support in-memory and disk backed collections (e.g. for Big Data). Generates distributed algorithms automatically. Flexible communication Concise input language Abstracts away from complex details Limitations Currently data parallel algorithms must be expressed via combinators Data distributions are limited to those that can be expressed in the types The Pros and Cons…

Future work Improve prototype implementation. More distributed collection types and combinators. Better function equation solving during unification. Support more distributed memory architectures (GPUs). Retrofit into an existing functional language. Similar type inference for imperative languages? Suggestions?

QUESTIONS?