Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Parallel Algorithms Lecture Notes. Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Mapping Techniques for Load Balancing
Slide-1 Portal DR&E LLGrid Portal Interactive Supercomputing for DoD Albert Reuther, William Arcand, Chansup Byun, Bill Bergeron, Matthew Hubbell, Jeremy.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Timing Trials An investigation arising out of the Assignment CS32310 – Nov 2013 H Holstein 1.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Parallel Programming with MPI and OpenMP
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
Concurrency and Performance Based on slides by Henri Casanova.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Matrix Multiplication in Hadoop
Parallel Algorithm Design & Analysis Course Dr. Stephen V. Providence Motivation, Overview, Expectations, What’s next.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Auburn University
Parallel Matlab programming using Distributed Arrays
Parallel Programming By J. H. Wang May 2, 2017.
Programming Models for SimMillennium
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
Parallel Programming with MPI and OpenMP
Chapter 4: Threads.
Multicore Programming in pMatlab using Distributed Arrays
CSCE569 Parallel Computing
Parallelismo.
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work is sponsored by the Department of Defense under Air Force Contract FA C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

MIT Lincoln Laboratory Slide-2 Parallel MATLAB Goal: Think Matrices not Messages Programmer Effort Performance Speedup hoursdaysweeksmonths acceptable hardware limit Expert Novice In the past, writing well performing parallel programs has required a lot of code and a lot of expertise pMatlab distributed arrays eliminates the coding burden –However, making programs run fast still requires expertise This talk illustrates the key math concepts experts use to make parallel programs perform well

MIT Lincoln Laboratory Slide-3 Parallel MATLAB Serial Program Parallel Execution Distributed Arrays Explicitly Local Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary

MIT Lincoln Laboratory Slide-4 Parallel MATLAB Serial Program Matlab is a high level language Allows mathematical expressions to be written concisely Multi-dimensional arrays are fundamental to Matlab Y = X + 1 X,Y : N x N Y(:,:) = X + 1; X = zeros(N,N); Y = zeros(N,N); MathMatlab

MIT Lincoln Laboratory Slide-5 Parallel MATLAB Pid=Np-1 Pid=1 P ID =N P -1 P ID =1 Pid=0 P ID =0 Parallel Execution Run N P (or Np) copies of same program –Single Program Multiple Data (SPMD) Each copy has a unique P ID (or Pid) Every array is replicated on each copy of the program Y = X + 1 X,Y : N x N Y(:,:) = X + 1; X = zeros(N,N); Y = zeros(N,N); MathpMatlab

MIT Lincoln Laboratory Slide-6 Parallel MATLAB Pid=Np-1 Pid=1 Pid=0 P ID =N P -1 P ID =1 P ID =0 Distributed Array Program Use P() notation (or map) to make a distributed array Tells program which dimension to distribute data Each program implicitly operates on only its own data (owner computes rule) Y = X + 1 X,Y : P(N) x N Y(:,:) = X + 1; XYmap = map([Np N1],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYap); MathpMatlab

MIT Lincoln Laboratory Slide-7 Parallel MATLAB Explicitly Local Program Use.loc notation (or local function) to explicitly retrieve local part of a distributed array Operation is the same as serial program, but with different data on each processor (recommended approach) Y.loc = X.loc + 1 X,Y : P(N) x N Yloc(:,:) = Xloc + 1; XYmap = map([Np 1],{},0:Np-1); Xloc = local(zeros(N,N,XYmap)); Yloc = local(zeros(N,N,XYmap)); MathpMatlab

MIT Lincoln Laboratory Slide-8 Parallel MATLAB Maps Redistribution Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary

MIT Lincoln Laboratory Slide-9 Parallel MATLAB Parallel Data Maps A map is a mapping of array indices to processors Can be block, cyclic, block-cyclic, or block w/overlap Use P() notation (or map) to set which dimension to split among processors P(N) x N Xmap=map([Np 1],{},0:Np-1) MathMatlab Computer P ID Pid Array N x P(N) Xmap=map([1 Np],{},0:Np-1) P(N) x P(N) Xmap=map([Np/2 2],{},0:Np-1)

MIT Lincoln Laboratory Slide-10 Parallel MATLAB Maps and Distributed Arrays mapassignment of blocks of data to processing elements A processor map for a numerical array is an assignment of blocks of data to processing elements. Amap = map([Np 1],{},0:Np-1); Processor Grid A = zeros(4,6,Amap); P0 P1 P2 P3 List of processors pMatlab constructors are overloaded to take a map as an argument, and return a distributed array. A = Distribution {}=default=block

MIT Lincoln Laboratory Slide-11 Parallel MATLAB Advantages of Maps FFT along columns Matrix Multiply * MAP1MAP2 Maps are scalable. Changing the number of processors or distribution does not change the application. Maps support different algorithms. Different parallel algorithms have different optimal mappings. Maps allow users to set up pipelines in the code (implicit task parallelism). foo1foo2foo3foo4 %Application A=rand(M,map ); B=fft(A); map1=map([Np 1],{},0:Np-1)map2=map([1 Np],{},0:Np-1) map([2 2],{},0:3)map([2 2],{},[ ]) map([2 2],{},1) map([2 2],{},0) map([2 2],{},2) map([2 2],{},3)

MIT Lincoln Laboratory Slide-12 Parallel MATLAB Redistribution of Data Different distributed arrays can have different maps Assignment between arrays with the “=” operator causes data to be redistributed Underlying library determines all the message to send Y = X + 1 Y : N x P(N) Xmap = map([Np 1],{},0:Np-1); Ymap = map([1 Np],{},0:Np-1); X = zeros(N,N,Xmap); Y = zeros(N,N,Ymap); Y(:,:) = X + 1; MathpMatlab X : P(N) x N P0 P1 P2 P3 X = P0P1P2P3 Y = Data Sent

MIT Lincoln Laboratory Slide-13 Parallel MATLAB Definition Example Metrics Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary

MIT Lincoln Laboratory Slide-14 Parallel MATLAB Definitions Parallel Concurrency Number of operations that can be done in parallel (i.e. no dependencies) Measured with: Degrees of Parallelism Parallel Locality Is the data for the operations local to the processor Measured with ratio: Computation/Communication = (Work)/(Data Moved) Concurrency is ubiquitous; “easy” to find Locality is harder to find, but is the key to performance Distributed arrays derive concurrency from locality

MIT Lincoln Laboratory Slide-15 Parallel MATLAB Serial Concurrency: max degrees of parallelism = N 2 Locality –Work = N 2 –Data Moved: depends upon map MathMatlab for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end X = zeros(N,N); Y = zeros(N,N); X,Y : N x N

MIT Lincoln Laboratory Slide-16 Parallel MATLAB 1D distribution Concurrency: degrees of parallelism = min(N,N P ) Locality: Work = N 2, Data Moved = 0 Computation/Communication = Work/(Data Moved)   MathpMatlab for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 X,Y : P(N) x N for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end XYmap = map([NP 1],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYmap);

MIT Lincoln Laboratory Slide-17 Parallel MATLAB 2D distribution Concurrency: degrees of parallelism = min(N 2,N P ) Locality: Work = N 2, Data Moved = 0 Computation/Communication = Work/(Data Moved)   MathpMatlab for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end XYmap = map([Np/2 2],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYmap); X,Y : P(N) x P(N)

MIT Lincoln Laboratory Slide-18 Parallel MATLAB 2D Explicitly Local Concurrency: degrees of parallelism = min(N 2,N P ) Locality: Work = N 2, Data Moved = 0 Computation/Communication = Work/(Data Moved)   MathpMatlab for i=1:size(X.loc,1) for j=1:size(X.loc,2) Y.loc(i,j) = X.loc(i,j) + 1 for i=1:size(Xloc,1) for j=1:size(Xloc,2) Yloc(i,j) = Xloc(i,j) + 1; end XYmap = map([Np/2 2],{},0:Np-1); Xloc = local(zeros(N,N,XYmap)); Yloc = local(zeros(N,N,XYmap)); X,Y : P(N) x P(N)

MIT Lincoln Laboratory Slide-19 Parallel MATLAB 1D with Redistribution Concurrency: degrees of parallelism = min(N,N P ) Locality: Work = N 2, Data Moved = N 2 Computation/Communication = Work/(Data Moved) = 1 MathpMatlab for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 Xmap = map([Np 1],{},0:Np-1); Ymap = map([1 Np],{},0:Np-1); X = zeros(N,N,Xmap); Y = zeros(N,N,Ymap); for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end Y : N x P(N) X : P(N) x N

MIT Lincoln Laboratory Slide-20 Parallel MATLAB Four Step Process Speedup Amdahl’s Law Perforfmance vs Effort Portability Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary

MIT Lincoln Laboratory Slide-21 Parallel MATLAB Running Start Matlab –Type: cd examples/AddOne Run dAddOne –Edit pAddOne.m and set: PARALLEL = 0; –Type: pRUN(’pAddOne’,1,{}) Repeat with: PARALLEL = 1; Repeat with: pRUN(’pAddOne’,2,{}); Repeat with: pRUN(’pAddOne’,2,{’cluster’}); Four steps to taking a serial Matlab program and making it a parallel Matlab program

MIT Lincoln Laboratory Slide-22 Parallel MATLAB Parallel Debugging Processes Simple four step process for debugging a parallel program Serial Matlab Serial pMatlab Parallel pMatlab Optimized pMatlab Mapped pMatlab Functional correctness pMatlab correctness Parallel correctness Performance Step 1 Add DMATs Step 2 Add Maps Step 4 Add CPUs Step 3 Add Matlabs Add distributed matrices without maps, verify functional correctness PARALLEL=0; pRUN(’pAddOne’,1,{}); Add maps, run on 1 processor, verify parallel correctness, compare performance with Step 1 PARALLEL=1; pRUN(’pAddOne’,1,{}); Run with more processes, verify parallel correctness PARALLEL=1; pRUN(’pAddOne’,2,{}) ); Run with more processors, compare performance with Step 2 PARALLEL=1; pRUN(’pAddOne’,2,{‘cluster’}); Always debug at earliest step possible (takes less time)

MIT Lincoln Laboratory Slide-23 Parallel MATLAB Timing Run dAddOne: pRUN(’pAddOne’,1,{’cluster’}); –Record processing_time Repeat with: pRUN(’pAddOne’,2,{’cluster’}); –Record processing_time Repeat with: pRUN(’pAddone’,4,{’cluster’}); –Record processing_time Repeat with: pRUN(’pAddone’,8,{’cluster’}); –Record processing_time Repeat with: pRUN(’pAddone’,16,{’cluster’}); –Record processing_time Run program while doubling number of processors Record execution time

MIT Lincoln Laboratory Slide-24 Parallel MATLAB Computing Speedup Number of Processors Speedup Speedup Formula: Speedup(N P ) = Time(N P =1)/Time(N P ) Goal is sublinear speedup All programs saturate at some value of N P

MIT Lincoln Laboratory Slide-25 Parallel MATLAB Amdahl’s Law Processors Speedup S max = w | -1 N P = w | -1 S max /2 Linear w | = 0.1 Divide work into parallel (w || ) and serial (w | ) fractions Serial fraction sets maximum speedup: S max = w | -1 Likewise: Speedup(N P =w | -1 ) = S max /2

MIT Lincoln Laboratory Slide-26 Parallel MATLAB HPC Challenge Speedup vs Effort Relative Code Size Speedup C + MPI Matlab pMatlab ideal speedup = 128 FFT HPL STREAM Random Access STREAM FFT HPL(32) STREAM FFT HPL Ultimate Goal is speedup with minimum effort HPC Challenge benchmark data shows that pMatlab can deliver high performance with a low code size Serial C

MIT Lincoln Laboratory Slide-27 Parallel MATLAB Portable Parallel Programming Amap = map([Np 1],{},0:Np-1); Bmap = map([1 Np],{},0:Np-1); A = rand(M,N,Amap); B = zeros(M,N,Bmap); B(:,:) = fft(A); Universal Parallel Matlab programming pMatlab runs in all parallel Matlab environments Only a few functions are needed –Np –Pid –map –local –put_local –global_index –agg –SendMsg/RecvMsg Jeremy Kepner Parallel Programming in pMatlab Only a small number of distributed array functions are necessary to write nearly all parallel programs Restricting programs to a small set of functions allows parallel programs to run efficiently on the widest range of platforms

MIT Lincoln Laboratory Slide-28 Parallel MATLAB Serial MATLAB Serial pMatlab Parallel pMatlab Optimized pMatlab Mapped pMatlab Add DMATsAdd MapsAdd MatlabsAdd CPUs Functional correctness pMatlab correctness Parallel correctness Performance Step 1Step 2Step 3Step 4 Get It Right Make It Fast Summary Distributed arrays eliminate most parallel coding burden Writing well performing programs requires expertise Experts rely on several key concepts –Concurrency vs Locality –Measuring Speedup –Amdahl’s Law Four step process for developing programs –Minimizes debugging time –Maximizes performance