Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Introduction to Matlab

1 EMT 101 – Engineering Programming Dr. Farzad Ismail School of Aerospace Engineering Universiti Sains Malaysia Nibong Tebal Pulau Pinang Week 10.

Parallel Computing in Matlab

Introduction to Matlab Workshop Matthew Johnson, Economics October 17, /13/20151.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.

Lecture 15 Orthogonal Functions Fourier Series. LGA mean daily temperature time series is there a global warming signal?

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.

Lecture 6 MATLAB functions Basics of Built-in Functions, Help Feature, Elementary Functions (e.g., Polynomials, Trigonometric Functions), Data Analysis,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

C ENTER FOR I NTEGRATED R ESEARCH C OMPUTING MATLAB

18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.

An Introduction to Programming with CUDA Paul Richmond

Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Python  By: Ben Blake, Andrew Dzambo, Paul Flanagan.

1 © 2012 The MathWorks, Inc. Speeding up MATLAB Applications.

MATLAB Lecture One Monday 4 July Matlab Melvyn Sim Department of Decision Sciences NUS Business School

Chapter 5. Loops are common in most programming languages Plus side: Are very fast (in other languages) & easy to understand Negative side: Require a.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Multi-Dimensional Arrays

Mex. Introduction to MEX MEX = Matlab EXecutable – Dynamically Linked Libraries – Used like a.m function – Written in C (or Fortran)

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.

Numerical Computation Lecture 2: Introduction to Matlab Programming United International College.

Matlab Basics Tutorial. Vectors Let's start off by creating something simple, like a vector. Enter each element of the vector (separated by a space) between.

CIS 565 Fall 2011 Qing Sun

Introduction to C & C++ Lecture 10 – library JJCAO.

GPU Architecture and Programming

Recap Sum and Product Functions Matrix Size Function Variance and Standard Deviation Random Numbers Complex Numbers.

Scientific Computing Introduction to Matlab Programming.

GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

MA/CS 375 Fall 2002 Lecture 3. Example 2 A is a matrix with 3 rows and 2 columns.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Department of Electrical and Computer Engineering Introduction to C++: Primitive Data Types, Libraries and Operations By Hector M Lugo-Cordero August 27,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

MATLAB Lecture Two Tuesday 5 July Chapter 3.

NCEP ESMF GFS Global Spectral Forecast Model Weiyu Yang, Mike Young and Joe Sela ESMF Community Meeting MIT, Cambridge, MA July 21, 2005.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Introduction to MATLAB 1.Basic functions 2.Vectors, matrices, and arithmetic 3.Flow Constructs (Loops, If, etc) 4.Create M-files 5.Plotting.

MA/CS 375 Fall 2002 Lecture 2. Motivation for Suffering All This Math and Stuff Try the Actor demo from

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

MATLAB Constants, Variables & Expression Nafees Ahmed Asstt. Professor, EE Deptt DIT, DehraDun.

Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.

Introduction to CMex E177 April 25, Copyright 2005, Andy Packard. This work is licensed under the Creative.

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

1 Types of Programming Language (1) Three types of programming languages 1.Machine languages Strings of numbers giving machine specific instructions Example:

Parallel Computing with MATLAB Modified for 240A UCSB Based on Jemmy Hu University of Waterloo

MATLAB (Matrix Algebra laboratory), distributed by The MathWorks, is a technical computing environment for high performance numeric computation and.

Outline What is MATLAB MATLAB desktop Variables, Vectors and Matrices Matrix operations Array operations Built-in functions: Scalar, Vector, Matrix Data.

Computer Engg, IIT(BHU)

Programming in R Intro, data and programming structures

Basic CUDA Programming

INTRODUCTION TO BASIC MATLAB

MATLAB DENC 2533 ECADD LAB 9.

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Oct. 27, By: CBI Development Team

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Chapter 4:Parallel Programming in CUDA C

ENERGY 211 / CME 211 Lecture 28 December 1, 2008.

Presentation transcript:

Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed memory environments  Using the Parallel Computing Toolbox  Using the Matlab Distributed Compute Engine Server  Using pMatlab Mixing Matlab and Fortran/C code Compiling MEX code from C/Fortran Turning Matlab routines into C code

Symmetric Multi-Processing By default Matlab uses all cores on a given node for operations that can be threaded Arrays and matrices Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN Operators: +, -,.*,./,.\,.^, *, ^, \ (MLDIVIDE), / (MRDIVIDE) Array operations: PROD, SUM Array manipulation: BSXFUN, SORT Linear algebra Matrix Analysis: DET, RCOND Linear Equations: CHOL, INV, LDL, LINSOLVE, LU, QR Eigenvalues and singular values: EIG, HESS, SCHUR, SVD, QZ Elementary math Trigonometric: ATAN2, COS, CSC, HYPOT, SEC, SIN, TAN, including variants for radians, degrees, hyperbolics, and inverses. Exponential: EXP, POW2, SQRT Complex: ABS Rounding and remainder: CEIL, FIX, FLOOR, MOD, REM, ROUND LOG, LOG2, LOG10, LOG1P, EXPM1, SIGN, BITAND, BITOR, BITXOR Special Functions ERF, ERFC, ERFCINV, ERFCX, ERFINV, GAMMA, GAMMALN Data Analysis CONV2, FILTER, FFT and IFFT of multiple columns or long vectors, FFTN, IFFTN

Symmetric Multi-Processing To be sure you only use the resources you request, you should either request an entire node and all of the CPU’s... Or request a single cpu and specify that Matlab should only use a single thread salloc –t 60 -c 24 module load matlab srun matlab salloc –t 60 module load matlab srun matlab -singleCompThread

Using GPUs with Matlab Matlab can use GPUs to do calculations, provided a GPU is available on the node Matlab is running on. We can query the connected GPUs from within Matlab using And obtain a list of GPU supported functions using salloc -t 60 --gres=gpu:1 module load matlab module load cuda matlab gpuDeviceCount() gpuDevice() methods('gpuArray')

Using GPUs with Matlab So there is a 2D FFT – but no Hilbert function... We could do the log and abs functions on the GPU as well. H=hilb(1000); H_=gpuArray(H); Z_=fft2(H_); Z=gather(Z_); imagesc(log(abs(Z))); H=hilb(1000); H_=gpuArray(H); Z_=fft2(H_); imagesc(gather(log(abs(Z_))); Distribute data to GPU FFT performed on GPU Gather data from GPU onto CPU

Using GPUs with Matlab For our example, doing the FFT on the GPU is 7x faster. (4x if you include moving the data to the GPU and back) >> H=hilb(5000); >> tic; A=gather(gpuArray(H)); toc Elapsed time is seconds. >> tic; A=gather(fft2(gpuArray(H))); toc Elapsed time is seconds. >> tic; A=fft2(H); toc Elapsed time is seconds.

Using GPUs with Matlab Matlab has no built in hilb() function that can run on the GPU – but we can write our own function(kernel) in cuda – and save it to hilbert.cu And compile it with nvcc to generate a Parallel Thread eXecution file __global__ void HilbertKernel( double * const out, size_t const numRows, size_t const numCols) { const int rowIdx = blockIdx.x * blockDim.x + threadIdx.x; const int colIdx = blockIdx.y * blockDim.y + threadIdx.y; if ( rowIdx >= numRows ) return; if ( colIdx >= numCols ) return; size_t linearIdx = rowIdx + colIdx*numRows; out[linearIdx] = 1.0 / (double)(1+rowIdx+colIdx) ; } nvcc -ptx hilbert.cu

Using GPUs with Matlab We have to initialize the kernel and specify the grid size before executing the kernel. The default for matlab kernel’s is 1 thread per block, but we could create fewer blocks that were each 10 x 10 threads. H_=gpuArray.zeros(1000); hilbert_kernel=parallel.gpu.CUDAKernel('hilbert.ptx','hilbert.cu'); hilbert_kernel.GridSize=size(H_); H_=feval(hilbert_kernel, H_, 1000,1000); Z_=fft2(H_); imagesc(gather(log(abs(Z_)))); hilbert_kernel.ThreadBlockSize=[10,10,1]; hilbert_kernel.GridSize=[100,100];

Parallel Computing Toolbox As an alternative you can also use the Parallel Computing Toolbox. This supports parallelism via MPI You should create a pool that is the same size as the number of processors you requested in your job submission. Matlab also sells licenses for using a Distributed Computing Server which allows for matlabpools that use more than just the local node.

Parallel Computing Toolbox You can achieve parallelism in several ways: parfor loops – execute for loops in parallel smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) pmode – interactive version of smpd distributed arrays – very similar to gpuArrays.

Parallel Computing Toolbox You can achieve parallelism in several ways: parfor loops – execute for loops in parallel smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) pmode – interactive version of smpd distributed arrays – very similar to gpuArrays. matlabpool(4) parfor n=1:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf')); end matlabpool close

Parallel Computing Toolbox You can achieve parallelism in several ways: parfor loops – execute for loops in parallel smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) pmode – interactive version of smpd distributed arrays – very similar to gpuArrays. matlabpool(4) spmd for n=drange(1:100) H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); end matlabpool close matlabpool(4) spmd for n=labindex:numlabs:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); end matlabpool close

Parallel Computing Toolbox You can achieve parallelism in several ways: parfor loops – execute for loops in parallel smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) pmode – interactive version of smpd distributed arrays – very similar to gpuArrays. pmode start 4 pmode lab2client H 3 H3 H3 pmode close n=labindex; H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));

Parallel Computing Toolbox You can achieve parallelism in several ways: parfor loops – execute for loops in parallel smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) pmode – interactive version of smpd distributed arrays – very similar to gpuArrays. H=hilb(1000); H_=gpuArray(H); Z_=fft2(H_); Z=gather(Z_); imagesc(log(abs(Z))); matlabpool(8) H=hilb(1000); H_=distributed(H); Z_=fft(fft(H_,[],1),[],2); Z=gather(Z_); imagesc(log(abs(Z))); matlabpool close Example using gpuArray Example using distributed arrays

Parallel Computing Toolbox matlabpool(4) spmd codist=codistributor1d(1,[250,250,250,250],[1000,1000]); [i_lo, i_hi]=codist.globalIndices(1); H_local=zeros(250,1000); for i=i_lo:i_hi for j=1:1000 H_local(i-i_lo+1,j)=1/(i+j-1); end H_ = codistributed.build(H_local, codist); end Z_=fft(fft(H_,[],1),[],2); Z=gather(Z_); imagesc(log(abs(Z))); matlabpool close What about building hilbert matrix in parallel? Define partition Get local indices in x-direction Allocate space for local part Initialize local array with Hilbert values. Assemble codistributed array Now it's distributed like before!

Using pMatlab pMatlab is an alternative method to get distributed matlab functionality without relying on Matlab’s Distributed Computing Server. It is built on top of MapMPI (an MPI implementation for matlab – written in matlab - that uses file I/O for communication) It supports various operations on distributed arrays (up to 4D)  Remapping, aggregating, finding non-zero entries, transposing, ghosting  Elementary math functions (trig, exponential, complex, remainder/rounding)  2D Convolutions, FFTs, Discrete Cosine Transform  FFT's need to be properly mapped (cannot be distributed along transform dimension). It does not have as much functionality as the parallel computing toolbox – but it does support ghosting and more flexible partitioning!

Using pMatlab Since pMatlab works by launching other Matlab instances – we need them to startup with pMatlab functionality. To do so we need to add a few lines to our startup.m file in our matlab path. addpath('/software/pMatlab/MatlabMPI/src'); addpath('/software/pMatlab/src'); rehash; pMatlabGlobalsInit;

Running pMatlab in Batch Mode To submit a job in batch mode we need to create a batch script And a Matlab script to launch the pMatlab script #SBATCH -N Matlab #SBATCH -p standard #SBATCH –t 60 #SBATCH –N 2 #SBATCH --ntasks-per-node=8 module load matlab matlab -nodisplay -r "pmatlab_launcher" nProcs=getenv('SLURM_NTASKS); [sreturn, machines]=system('nodelist'); machines=regexp(machines, '\n', 'split'); machines=machines(1:size(machines,2)-1); eval(pRUN('pmatlab_script',nProcs,machines)); sample_script.pbs pmatlab_launcher.m

Running pMatlab in Batch Mode And finally we have our pmatlab script. Xmap=map([Np 1],{},0:Np-1); H_=zeros(1000,1000,Xmap); [I1,I2]=global_block_range(H_); H_local=zeros(I1(2)-I1(1)+1,I2(2)-I2(1)+1); for i=I1(1):I1(2) for j=I2(1):I2(2) H_local(i-I1(1)+1,j-I2(1)+1)=1/(i+j-1); end H_=put_local(H_,H_local); Z_=fft(fft(H_,[],2),[],1); Z=agg(Z_); if (pMATLAB.my_rank == pMATLAB.leader) f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300', 'fig1.pdf'); end map for distributing array Distributed matrix constructor Indices for local portion of array Allocate and populate local portion of array with Hilbert matrix values Copy local values into distributed array Do y-fft and do x-fft. Z_ has different map Collect resulting matrix onto 'leader' Plot result from 'leader' matlab process pmatlab_script.m X = put_local(X, fft(local(X),[],2)); Z = transpose_grid(X); Z = put_local(Z, fft(local(Z),[],1));

Compiling Mex Code C, C++, or Fortran routines can be called from within Matlab. #include "fintrf.h" subroutine mexfunction(nlhs, plhs, nrhs, prhs) mwPointer :: plhs(*), prhs(*) integer :: nlhs, nrhs mwPointer :: mxGetPr mwPointer :: mxCreateDoubleMatrix real(8) :: mxGetScalar mwPointer :: pr_out integer :: n n = nint(mxGetScalar(prhs(1))) plhs(1) = mxCreateDoubleMatrix(n,n, 0) pr_out = mxGetPr(plhs(1)) call compute(%VAL(pr_out),n) end subroutine mexfunction subroutine compute(h, n) integer :: n real(8) :: h(n,n) integer :: i,j do i=1,n do j=1,n h(i,j)=1d0/(i+j-1d0) end do end subroutine compute mex hilbert.F90 >> H=hilbert(10)

Turning Matlab code into C First we create a log_abs_fft_hilb.m function And then we run This will produce a mex file that we can test. We could have specified the type of 'n' in our matlab function function result = log_abs_fft_hilb(n) result=log(abs(fft2(hilb(n)))); >> codegen log_abs_fft_hilb.m –args {uint32(0)} >> A=log_abs_fft_hilb_mex(uint32(16)); >> B=log_abs_fft_hilb(16); >> max(max(abs(A-B))) ans = e-16 function result = log_abs_fft_hilb(n) assert(isa(n,'uint32')); result=log(abs(fft2(hilb(n))));

Turning Matlab code into C Now we can also export a static library that we can link to: This will create a subdirectory codegen/lib/log_abs_fft_hilb that will have the source files '.c and.h' as well as a compiled object files '.o' and a library 'log_abs_fft_hilb.a' The source files are portable to any platform with a 'C' compiler (ie BlueStreak). We can rebuild the library on BlueStreak by running >> codegen log_abs_fft_hilb.m -config coder.config('lib') -args {'uint32(0)'} mpixlc –c *.c ar rcs log_abs_fft_hilb.a *.o

Turning Matlab code into C To use the function, we still need to write a main subroutine that links to it. This requires working with matlab's variable types (which include dynamically resizable arrays) #include "stdio.h" #include "rtwtypes.h" #include "log_abs_fft_hilb_types.h" void main() { uint32_T n=64; emxArray_real_T *result; int32_T i,j; emxInit_real_T(&result, 2); log_abs_fft_hilb(n, result); for(i=0;i size[0];i++) { for(j=0;j size[1];j++) { printf("%f ",result->data[i+result->size[0]*j]); } printf("\n"); } emxFree_real_T(&result); } Matlab type definitions Argument to Matlab function Return value of Matlab function Initialize Matlab array to have rank 2 Call matlab function Free up memory associated with return array Output result in column major order

Turning Matlab code into C And here is another example of calling 2D fft's on real data void main() { int32_T q0; int32_T i; int32_T n=8; emxArray_creal_T *result; emxArray_real_T *input; emxInit_creal_T(&result, 2); emxInit_real_T(&input, 2); q0 = input->size[0] * input->size[1]; input->size[0]=n; input->size[1]=n; emxEnsureCapacity((emxArray__common *)input, q0, (int32_T)sizeof(real_T)); for(j=0;j size[1];j++ { for(i=0;i size[0];i++) { input->data[i+input->size[0]*j]=1.0 / (real_T)(i+j+1); } my_fft(input, result); for(i=0;i size[0];i++) { for(j=0;j size[1];j++) { printf("[% 10.4f,% 10.4f] ", result->data[i+result->size[0]*j].re, result->data[i+result->size[0]*j].im); } printf("\n"); } emxFree_creal_T(&result); emxFree_real_T(&input); }

Turning Matlab code into C Exported FFT's only work on vectors of length 2 N Error checking is disabled in exported C code Mex code will have the same functionality as exported C code, but will also have error checking. It will warn about doing FFT's on arbitrary length vectors, etc... Always test your mex code!

Matlab code is not that different from C code #include void main() { int n=4096; int i,j; double complex temp[n][n], input[n][n]; double result[n][n]; fftw_plan p; p=fftw_plan_dft_2d(n, n, &input[0][0], &temp[0][0], FFTW_FORWARD, FFTW_ESTIMATE); for (i=0;i<n;i++){ for(j=0;j<n;j++) { input[i][j]=(double complex)(1.0/(double)(i+j+1)); } fftw_execute(p); for (i=0;i<n;i++){ for(j=0;j<n;j++) { result[i][j]=log(cabs(temp[i][j])); } for (i=0;i<n;i++){ for(j=0;j<n;j++) { printf("%f ",result[i][j]); } fftw_destroy_plan(p); } Or you can write your own 'C' code that uses open source mathematical libraries (ie fftw).