Multicore Programming in pMatlab using Distributed Arrays

Slides:



Advertisements
Similar presentations
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Mapping Techniques for Load Balancing
Slide-1 Portal DR&E LLGrid Portal Interactive Supercomputing for DoD Albert Reuther, William Arcand, Chansup Byun, Bill Bergeron, Matthew Hubbell, Jeremy.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
Concurrency and Performance Based on slides by Henri Casanova.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Parallel Algorithm Design & Analysis Course Dr. Stephen V. Providence Motivation, Overview, Expectations, What’s next.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Auburn University
OPERATING SYSTEMS CS 3502 Fall 2017
Auburn University
Algorithms and Problem Solving
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Chapter 6: User-Defined Functions I
Parallel Matlab programming using Distributed Arrays
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Introduction to Parallelism.
Multi-Processing in High Performance Computer Architecture:
User-Defined Functions
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
Parallel Programming with MPI and OpenMP
Chapter 4: Threads.
Use of Mathematics using Technology (Maltlab)
Parallel Matrix Operations
CSE8380 Parallel and Distributed Processing Presentation
CSCE569 Parallel Computing
Parallelismo.
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
By Brandon, Ben, and Lee Parallel Computing.
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Lecture 2 The Art of Concurrency
Matrix Addition and Multiplication
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. Title Slides 1 1

Goal: Think Matrices not Messages In the past, writing well performing parallel programs has required a lot of code and a lot of expertise pMatlab distributed arrays eliminates the coding burden However, making programs run fast still requires expertise This talk illustrates the key math concepts experts use to make parallel programs perform well Programmer Effort Performance Speedup 100 10 1 0.1 hours days weeks months acceptable hardware limit Expert Novice The goal of the presentation is to show the techniques for writing well performing parallel Matlab programs.

Outline Parallel Design Distributed Arrays Concurrency vs Locality Serial Program Parallel Execution Distributed Arrays Explicitly Local Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary Outline Slide

Serial Program X,Y : NxN Y = X + 1 Math Matlab X = zeros(N,N); Y = zeros(N,N); X,Y : NxN Y = X + 1 Y(:,:) = X + 1; Mathematical notation and corresponding matlab code for a simple calculation. Matlab is a high level language Allows mathematical expressions to be written concisely Multi-dimensional arrays are fundamental to Matlab

Parallel Execution X,Y : NxN Y = X + 1 Math pMatlab X = zeros(N,N); Pid=Np-1 Pid=1 PID=NP-1 PID=1 Pid=0 PID=0 X = zeros(N,N); Y = zeros(N,N); X,Y : NxN Y = X + 1 Y(:,:) = X + 1; Mathematical notation and corresponding matlab code for a simple parallel calculation with replicated arrays. Run NP (or Np) copies of same program Single Program Multiple Data (SPMD) Each copy has a unique PID (or Pid) Every array is replicated on each copy of the program

Distributed Array Program Math pMatlab Pid=Np-1 PID=NP-1 Pid=1 Pid=0 PID=1 PID=0 XYmap = map([Np N1],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYap); X,Y : P(N)xN Y = X + 1 Y(:,:) = X + 1; Mathematical notation and corresponding matlab code for a simple parallel calculation with distributed arrays. Use P() notation (or map) to make a distributed array Tells program which dimension to distribute data Each program implicitly operates on only its own data (owner computes rule)

Explicitly Local Program Math pMatlab Yloc(:,:) = Xloc + 1; XYmap = map([Np 1],{},0:Np-1); Xloc = local(zeros(N,N,XYmap)); Yloc = local(zeros(N,N,XYmap)); X,Y : P(N)xN Y.loc = X.loc + 1 Mathematical notation and corresponding pMatlab code for a simple parallel calculation with distributed explicitly local arrays. Use .loc notation (or local function) to explicitly retrieve local part of a distributed array Operation is the same as serial program, but with different data on each processor (recommended approach)

Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary Maps Redistribution Outline Slide

P(N)xN NxP(N) P(N)xP(N) Parallel Data Maps 1 2 3 Array Math Matlab Xmap=map([Np 1],{},0:Np-1) NxP(N) Xmap=map([1 Np],{},0:Np-1) P(N)xP(N) Xmap=map([Np/2 2],{},0:Np-1) Computer Mathematical notation and corresponding matlab code for different parallel data distributions. PID 1 2 3 Pid A map is a mapping of array indices to processors Can be block, cyclic, block-cyclic, or block w/overlap Use P() notation (or map) to set which dimension to split among processors

Maps and Distributed Arrays A processor map for a numerical array is an assignment of blocks of data to processing elements. Amap = map([Np 1],{},0:Np-1); Processor Grid Distribution {}=default=block List of processors A = zeros(4,6,Amap); Example illustrating how maps are use to distribute matrices. introduced a new datatype, the distriburted array: 1. Duplicates “look and feel” of the Matlab matrix 2. Adds capability to automatically parallelize a matrix A map is an object that concisely describes how a matrix is partitioned into blocks and how the blocks are assigned to different processors. P0 pMatlab constructors are overloaded to take a map as an argument, and return a distributed array. A = P1 P2 P3

Advantages of Maps * foo1 foo2 foo3 foo4 %Application A=rand(M,map<i>); B=fft(A); Maps are scalable. Changing the number of processors or distribution does not change the application. map1=map([Np 1],{},0:Np-1) map2=map([1 Np],{},0:Np-1) Matrix Multiply FFT along columns Maps support different algorithms. Different parallel algorithms have different optimal mappings. * map([2 2],{},[0 2 1 3]) map([2 2],{},0:3) Scalability is achieved because the task of writing the application is separated from the task of mapping the application. map([2 2],{},1) map([2 2],{},3) Maps allow users to set up pipelines in the code (implicit task parallelism). foo1 foo2 foo3 foo4 map([2 2],{},0) map([2 2],{},2)

Redistribution of Data Math pMatlab Xmap = map([Np 1],{},0:Np-1); Ymap = map([1 Np],{},0:Np-1); X = zeros(N,N,Xmap); Y = zeros(N,N,Ymap); Y(:,:) = X + 1; X : P(N)xN Y : NxP(N) Y = X + 1 P0 P1 P2 P3 X = Y = Data Sent Mathematical notation and corresponding matlab code for redistributing data in a simple parallel program. Different distributed arrays can have different maps Assignment between arrays with the “=” operator causes data to be redistributed Underlying library determines all the message to send

Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary Definition Example Metrics Outline Slide

Definitions Parallel Concurrency Number of operations that can be done in parallel (i.e. no dependencies) Measured with: Degrees of Parallelism Parallel Locality Is the data for the operations local to the processor Measured with ratio: Computation/Communication = (Work)/(Data Moved) Definitions of parallel concurrency and parallel locality. Concurrency is ubiquitous; “easy” to find Locality is harder to find, but is the key to performance Distributed arrays derive concurrency from locality

Serial X,Y : NxN for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 Math Matlab end X = zeros(N,N); Y = zeros(N,N); X,Y : NxN for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 Concurrency and locality calculation for a simple serial program. Concurrency: max degrees of parallelism = N2 Locality Work = N2 Data Moved: depends upon map

1D distribution X,Y : P(N)xN for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 Math pMatlab XYmap = map([NP 1],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYmap); X,Y : P(N)xN for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end Concurrency and locality calculation for a simple 1D parallel program. Concurrency: degrees of parallelism = min(N,NP) Locality: Work = N2, Data Moved = 0 Computation/Communication = Work/(Data Moved)  

2D distribution X,Y : P(N)xP(N) for i=1:N for j=1:N Math pMatlab XYmap = map([Np/2 2],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYmap); X,Y : P(N)xP(N) for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end Concurrency and locality calculation for a simple 2D parallel program. Concurrency: degrees of parallelism = min(N2,NP) Locality: Work = N2, Data Moved = 0 Computation/Communication = Work/(Data Moved)  

2D Explicitly Local X,Y : P(N)xP(N) for i=1:size(X.loc,1) Math pMatlab for i=1:size(Xloc,1) for j=1:size(Xloc,2) Yloc(i,j) = Xloc(i,j) + 1; end XYmap = map([Np/2 2],{},0:Np-1); Xloc = local(zeros(N,N,XYmap)); Yloc = local(zeros(N,N,XYmap)); X,Y : P(N)xP(N) for i=1:size(X.loc,1) for j=1:size(X.loc,2) Y.loc(i,j) = X.loc(i,j) + 1 Concurrency and locality calculation for a simple 1D explicitly local parallel program. Concurrency: degrees of parallelism = min(N2,NP) Locality: Work = N2, Data Moved = 0 Computation/Communication = Work/(Data Moved)  

1D with Redistribution X : P(N)xN Y : NxP(N) for i=1:N for j=1:N Math pMatlab Xmap = map([Np 1],{},0:Np-1); Ymap = map([1 Np],{},0:Np-1); X = zeros(N,N,Xmap); Y = zeros(N,N,Ymap); Y : NxP(N) X : P(N)xN for i=1:N for j=1:N Y(i,j) = X(i,j) + 1 for i=1:N for j=1:N Y(i,j) = X(i,j) + 1; end Concurrency and locality calculation for a simple 1D parallel program with redistribution. Concurrency: degrees of parallelism = min(N,NP) Locality: Work = N2, Data Moved = N2 Computation/Communication = Work/(Data Moved) = 1

Outline Parallel Design Distributed Arrays Concurrency vs Locality Execution Summary Four Step Process Speedup Amdahl’s Law Perforfmance vs Effort Portability Outline Slide

Running Start Matlab Run dAddOne Repeat with: PARALLEL = 1; Type: cd examples/AddOne Run dAddOne Edit pAddOne.m and set: PARALLEL = 0; Type: pRUN(’pAddOne’,1,{}) Repeat with: PARALLEL = 1; Repeat with: pRUN(’pAddOne’,2,{}); Repeat with: pRUN(’pAddOne’,2,{’cluster’}); Example matlab code for the four step parallel debugging process. Four steps to taking a serial Matlab program and making it a parallel Matlab program

Parallel Debugging Processes Simple four step process for debugging a parallel program Serial Matlab Add distributed matrices without maps, verify functional correctness PARALLEL=0; pRUN(’pAddOne’,1,{}); Step 1 Add DMATs Serial pMatlab Functional correctness Add maps, run on 1 processor, verify parallel correctness, compare performance with Step 1 PARALLEL=1; pRUN(’pAddOne’,1,{}); Step 2 Add Maps Mapped pMatlab pMatlab correctness Run with more processes, verify parallel correctness PARALLEL=1; pRUN(’pAddOne’,2,{}) ); Step 3 Add Matlabs Parallel pMatlab Parallel correctness Methodology for debugging parallel applications. Run with more processors, compare performance with Step 2 PARALLEL=1; pRUN(’pAddOne’,2,{‘cluster’}); Step 4 Add CPUs Optimized pMatlab Performance Always debug at earliest step possible (takes less time)

Timing Run dAddOne: pRUN(’pAddOne’,1,{’cluster’}); Record processing_time Repeat with: pRUN(’pAddOne’,2,{’cluster’}); Repeat with: pRUN(’pAddone’,4,{’cluster’}); Repeat with: pRUN(’pAddone’,8,{’cluster’}); Repeat with: pRUN(’pAddone’,16,{’cluster’}); Example matlab code for timing a parallel code. Run program while doubling number of processors Record execution time

Computing Speedup Speedup Number of Processors Computing speedup for a parallel program. Number of Processors Speedup Formula: Speedup(NP) = Time(NP=1)/Time(NP) Goal is sublinear speedup All programs saturate at some value of NP

Amdahl’s Law Speedup Processors Smax = w|-1 NP = w|-1 Smax/2 Linear w| = 0.1 Definition and illustration of Amdahl’s Law. Divide work into parallel (w||) and serial (w|) fractions Serial fraction sets maximum speedup: Smax = w|-1 Likewise: Speedup(NP=w|-1) = Smax/2

HPC Challenge Speedup vs Effort 0.0001 0.001 0.01 0.1 1 10 100 1000 Relative Code Size Speedup C + MPI Matlab pMatlab ideal speedup = 128 STREAM STREAM HPL FFT FFT HPL(32) HPL STREAM Serial C FFT Random Access Random Access Random Access Examples of the performance achieved vs the code size for different parallel codes. Ultimate Goal is speedup with minimum effort HPC Challenge benchmark data shows that pMatlab can deliver high performance with a low code size

Portable Parallel Programming Universal Parallel Matlab programming Jeremy Kepner Parallel Programming in pMatlab Amap = map([Np 1],{},0:Np-1); Bmap = map([1 Np],{},0:Np-1); A = rand(M,N,Amap); B = zeros(M,N,Bmap); B(:,:) = fft(A); 1 2 3 4 pMatlab runs in all parallel Matlab environments Only a few functions are needed Np Pid map local put_local global_index agg SendMsg/RecvMsg Universal API for parallel Matlab programming. Only a small number of distributed array functions are necessary to write nearly all parallel programs Restricting programs to a small set of functions allows parallel programs to run efficiently on the widest range of platforms

Summary Distributed arrays eliminate most parallel coding burden Writing well performing programs requires expertise Experts rely on several key concepts Concurrency vs Locality Measuring Speedup Amdahl’s Law Four step process for developing programs Minimizes debugging time Maximizes performance Summary. Step 1 Step 2 Step 3 Step 4 Serial MATLAB Serial pMatlab Mapped pMatlab Parallel pMatlab Optimized pMatlab Add DMATs Add Maps Add Matlabs Add CPUs Functional correctness pMatlab correctness Parallel correctness Performance Get It Right Make It Fast