INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Kadin Tseng Boston University Scientific Computing and Visualization.

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Processing with OpenMP
Introduction to Openmp & openACC
Introduction to arrays
Parallel Computing in Matlab
Introduction to Matlab Workshop Matthew Johnson, Economics October 17, /13/20151.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Slide deck by Dr. Greg Reese Miami University MATLAB An Introduction With Applications, 5 th Edition Dr. Amos Gilat The Ohio State University Chapter 3.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Introduction to Analysis of Algorithms
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Kadin Tseng Boston University Scientific Computing and Visualization.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Introduction to MATLAB ENGR 1187 MATLAB 1. Programming In The Real World Programming is a powerful tool for solving problems in every day industry settings.
Introduction to programming in MATLAB MATLAB can be thought of as an super-powerful graphing calculator Remember the TI-83 from calculus? With many more.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
IE 212: Computational Methods for Industrial Engineering
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Introduction to Python
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Evaluation of Parallel Processing. Why Performance?
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Multi-Dimensional Arrays
Objectives Understand what MATLAB is and why it is widely used in engineering and science Start the MATLAB program and solve simple problems in the command.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
A Brief Introduction to Matlab Laila Guessous Dept. of Mechanical Engineering Oakland University.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Parallel Computing with MATLAB Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 24,
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
MATLAB for Engineers 4E, by Holly Moore. © 2014 Pearson Education, Inc., Upper Saddle River, NJ. All rights reserved. This material is protected by Copyright.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MATLAB Harri Saarnisaari, Part of Simulations and Tools for Telecommunication Course.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Introduction to MATLAB Session 1 Simopekka Vänskä, THL 2010.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
1 Software. 2 What is software ► Software is the term that we use for all the programs and data on a computer system. ► Two types of software ► Program.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Kadin Tseng Boston University Scientific Computing and Visualization.
1 Lecture 3 Post-Graduate Students Advanced Programming (Introduction to MATLAB) Code: ENG 505 Dr. Basheer M. Nasef Computers & Systems Dept.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Parallel Computing with MATLAB Modified for 240A UCSB Based on Jemmy Hu University of Waterloo
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Computer Engg, IIT(BHU)
Use of Mathematics using Technology (Maltlab)
Oct. 27, By: CBI Development Team
Distributed Systems CS
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 01: Introduction
Presentation transcript:

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Kadin Tseng Boston University Scientific Computing and Visualization

 Log on with your BU userid and Kerboros password  If you don’t have BU userid, then use this: userid: tuta1... tuta18 password: SCVsummer12 The number after tuta should match the number affixed on the front of your PC tower  Start MATLAB Log On To PC For Hands-on Practice MATLAB Parallel Computing Toolbox There are some files that you can copy over to your local folder for hands-on practices: >> copyfile(‘T:\kadin\pct\*’, ’.\’) M-Files on T drive 2

The Parallel Computing Toolbox is a MATLAB tool box. This tool box provides parallel utility functions to enable users to run MATLAB operations or procedures in parallel to speed up processing time. What is the PCT ? MATLAB Parallel Computing Toolbox 3

 Run on a desktop or laptop  MATLAB must be installed on local machine  Starting with R2011b, up to 12 processors can be used; up to 8 processors for older versions  Must have multi-core to gain speedup.  The thin client you are using has a dual-core processor  Requires BU userid (to access MATLAB, PCT licenses)  Run on a Katana Cluster node (as if a multi-cored desktop)  Requires SCV userid  Run on multiple Katana nodes ( for up to 32 processors)  Requires SCV userid Where To Run The PCT ? MATLAB Parallel Computing Toolbox 4

There are two types of parallel applications.  Distributed Jobs – task parallel  Parallel Jobs – data parallel Types of Parallel Jobs ? MATLAB Parallel Computing Toolbox 5

This type of parallel processing is classified as: Multiple tasks running independently on multiple workers with no information passed among them. On Katana, a distributed job is a series of single-processor batch jobs. This is also known as task-parallel, or “embarrassingly parallel”, jobs. Examples of distributed jobs: Monte Carlo simulations, image processing Parallel utility function: dfeval Distributed Jobs MATLAB Parallel Computing Toolbox 6

A parallel job is: A single task running concurrently on multiple workers that may communicate with each other. On Katana, this results in one batch job with multiple processors. This is also known as a data- parallel job. Examples of a parallel job include many linear algebra applications : matrix multiply; linear algebraic system of equations solvers; Eigen solvers. Some may run efficiently in parallel and others may not. It depends on the underlining algorithms and operations. This also include jobs that mix serial and parallel processing. Parallel utility functions: spmd, drange, parfor,... Parallel Jobs MATLAB Parallel Computing Toolbox 7

 Code dependent  Many MATLAB functions are overloaded to handle vector or parallel operations based on the variables’ data type, i.e., scalars, vectors, or distributed arrays. For some application, may just have to turn on parallel capability. [No effort to a few lines of code modification]  If arrays have been declared distributed (i.e., parallel), subsequent operations of these arrays using built-in or user functions will be processed in parallel without additional instructions. Arrays not declared as parallel will also be promoted to parallel if they are used with distributed arrays. [moderate effort: code, algorithm change]  For custom parallelization, MPI-like utilities are available. [More extensive effort: code, algorithm] How much work to parallelize my code ? MATLAB Parallel Computing Toolbox 8

Distributed Jobs — dfeval MATLAB Parallel Computing Toolbox Example: Run a dfeval job “interactively” Computes 1x4, 3x2 random arrays locally or on Katana (with pre-defined batch configuration ‘SGE’) >> y = 3},{4 2},'Configuration',‘local‘) Submitting task 1 Job output will be written to: /usr1/scv/kadin/Job6_Task1.out QSUB output: Your job ("Job6.1") has been submitted Submitting task 2 Job output will be written to: /usr1/scv/kadin/Job6_Task2.out QSUB output: Your job ("Job6.2") has been submitted y = [1x4 double] [3x2 double ] Job ran in batch or background, output returns to client workspace. or ‘SGE’ input to randapplication m-file 9

For task-parallel applications on the Katana Cluster, we strongly recommend the use of an SCV script instead of dfeval. This script does not use the PCT. For details, see cluster/runningjobs/multiple_matlab_tasks/ If your application fits the description of a distributed job, you don’t need to know any more beyond this point... Distributed Jobs – SCV script MATLAB Parallel Computing Toolbox 10

Two ways to run parallel jobs: pmode or matlabpool. This procedure turns on (off) parallelism and allocate (deallocate) resources. pmode is a special mode of application; useful for learning the PCT and parallel program prototyping interactively. matlabpool is the general mode of application; it can be used for interactive and batch processing. How Do I Run Parallel Jobs ? MATLAB Parallel Computing Toolbox 11

>> pmode start local % assuming 4 workers are available pmode MATLAB Parallel Computing Toolbox Above is a MATLAB window on a Katana node. A separate Parallel Command Window is spawned. (Similar for Windows) In Room B27, each thin client has 2 cores (workers). 12

pmode MATLAB Parallel Computing Toolbox >> pmode start local % use 4 workers in parallel mode  Any command issued at the “P>>” prompt is executed on all workers. Enter “labindex” to query for workers’ ID.  Use if conditional with labindex to issue instructions to specific workers, like this: if labindex==1, numlabs, end Worker number Command entered Enter command here PCT terminologies: worker = processor labindex: processor number numlabs: Number of processors 13

Spring pmode Replicate array P>> A = magic(3); % A is replicated on every worker Variant array P>> A = magic(3) + labindex – 1; % labindex=1,2,3,4 LAB 1 LAB 2 LAB 3 LAB 4 | | | | |8 1 6|9 2 7|10 3 8| |3 5 7|4 6 8| 5 7 9| |4 9 2|5 10 3| | Private array P >> if labindex==2, A = magic(3) + labindex – 1; end LAB 1 LAB 2 LAB 3 LAB 4 | | | | | |9 2 7| | |undefined|4 6 8|undefined|undefined | |5 10 3| |

Spring Switch from pmode to matlabpool If you are running MATLAB pmode, exit it. P>> exit MATLAB allows only one parallel environment at a time. pmode may be started with the keyword open or start. matlabpool can only be started with the keyword open. You can also close pmode from the MATLAB window: >> pmode close Now, open matlabpool >> matlabpool open local % rely on default worker size

Spring matlabpool matlabpool is the general paradigm for MATLAB parallel computing; it can be used for interactive and batch jobs. >> matlabpool open local % open 2 workers on thin client >> % >> % serial or parallel applications... >> % >> matlabpool close % ends matlabpool properly

Spring Parallel Methods — parfor parfor is a parallel for-loop. Work load is distributed according to loop index. Computation among loop indices must be independent. Operates in matlabpool. matlabpool open % use default number of workers ( 2 ) s = 0; parfor i=1:10 x(i) = sin(2*pi*i/10); % each slice of 5 computed by a worker s = s + i; % summation; a reduction operation end matlabpool close Does parfor work for all kinds of operations? No. As an example, a reduction operation (such as addition) must satisfy x ◊ (y ◊ z) = (x ◊ y) ◊ z associative rule Plus (+) and multiply (*) operators satisfy rule. Subtract (-) and divide (/) operators fail rule indeterministic

Spring Parallel Methods — parfor Try this: s = 0; parfor i=1:10, s = s + i, end, s % s = 55 (s = n; s = n(n+1)/2) s=1000; parfor i=1:5, s=s*i, end, s % do a few times s=1000; parfor i=1:5, s=s/i, end, s % do a few times Above +, * operators satisfy associative rule. The -, / operators fail rule. They may or may not produce same (correct) result each time. Some other operational rules for parfor: 1.Loop index must be consecutive integers. 2.Loop index must not be altered within loop. 3.Loop count need not be divisible by number of workers. 4.No need to distribute arrays. 5.Solution remains in the MATLAB workspace. 6.Does not work in spmd; can’t use numlabs, labindex. 7.No graphics can be displayed within parfor. 8.More rules depending on operations; consult the PCT doc.

Spring Integration Example An integration of the cosine function between 0 and π /2 Integration scheme is mid-point rule for simplicity. Several parallel methods will be demonstrated. cos(x) a = 0; b = pi/2; % range m = 8; % # of increments h = (b-a)/m; % increment p = numlabs; n = m/p; % inc. / worker ai = a + (i-1)*n*h; aij = ai + (j-1)*h; h x=bx=a mid-point of increment Worker 1 Worker 2 Worker 3 Worker 4

Spring Integration Example — Serial Integration % serial integration (with for-loop) tic m = 10000; a = 0; % lower limit of integration b = pi/2; % upper limit of integration dx = (b – a)/m; % increment length intSerial = 0; % initialize intSerial for i=1:m x = a+(i-0.5)*dx; % mid-point of increment i intSerial = intSerial + cos(x)*dx; end toc X(1) = a + dx/2 dx X (m) = b - dx/2

Spring Integration Example — Serial Integration % serial integration (with vector form) tic m = 10000; a = 0; % lower limit of integration b = pi/2; % upper limit of integration dx = (b – a)/m; % increment length x = a+dx/2:dx:b-dx/2; % mid-points of m increments intSerial = sum(cos(x)*dx); toc X(1) = a + dx/2 dx X (m) = b - dx/2

Spring Integration Example — spmd This example performs parallel integration in spmd. It uses labindex and numlabs to enable all workers to run the same command (“sp”) with different data (“md”). matlabpool open local tic % includes the overhead cost of spmd spmd m = 10000; a = 0; b = pi/2; n = m/numlabs; % # of increments per lab deltax = (b - a)/numlabs; % length per lab ai = a + (labindex - 1)*deltax; % local integration range bi = a + labindex*deltax; dx = deltax/n; % increment length for lab x = ai+dx/2:dx:bi-dx/2; % mid-points of n increments per worker intSPMD = sum(cos(x)*dx); % integral sum per worker intSPMD = gplus(intSPMD,1); % global sum over all workers end % spmd toc matlabpool close

Spring Integration Example — parfor This example performs parallel integration with parfor. matlabpool open 4 tic m = 10000; nworkers = matlabpool(‘size’); % returns # of workers assigned n = m/nworkers; % number of increments per worker a = 0; b = pi/2; deltax = (b – a)/nworkers; % increment length per worker dx = deltax/n; intParfor1 = 0; parfor i=1:nworkers ai = a + (i - 1)*deltax; bi = a + i*deltax; x = ai+dx/2:dx:bi-dx/2; % mid-points of n increments per worker intParfor1 = intParfor1 + sum(cos(x)*dx); end toc matlabpool close

Spring Integration Example — parfor This example performs parallel integration with parfor. matlabpool open 4 tic m = 10000; a = 0; b = pi/2; dx = (b – a)/m; % increment length intParfor2 = 0; parfor i=1:m intParfor2 = intParfor2 + cos(a+(i-0.5)*dx)*dx; end toc matlabpool close

Spring Integration Example — drange Similar to parfor but used within spmd. matlabpool open 4 tic m = 10000; a = 0; b = pi/2; spmd n = m/numlabs; % number of increments per lab (worker) deltax = (b - a)/numlabs; % increment length per worker for i=drange(1:numlabs) ai = a + (i - 1)*deltax; bi = a + i*deltax; dx = deltax/n; x = ai+dx/2:dx:bi-dx/2; % mid-points of n increments intDrange1 = sum(cos(x)*dx); end intDrange1 = gplus(intDrange1, 1); % send global sum to lab 1 end % spmd intDrange1 = intDrange1{1}; % send integral to client toc matlabpool close

Spring Integration Example — drange Similar to parfor but used within spmd. matlabpool open 4 tic m = 10000; a = 0; b = pi/2; spmd dx = (b - a)/m; % increment length per worker intDrange2 = 0; for i=drange(1:m) intDrange2 = intDrange2 + cos(a+(i-0.5)*dx)*dx; end intDrange2 = gplus(intDrange2, 1); % send global sum to lab 1 end % spmd intDrange2 = intDrange2{1}; % send integral to client toc matlabpool close

Spring Integration Example Benchmarks  Timings (seconds) obtained on a quad-core Xeon X5570  Computation linearly proportional to # of increments.  serial and serialv are times by loop and vector, respectively  parfor1 and drange1 distribute work over workers; local work performed in vector form.  parfor2 and drange2 distribute m over workers in chunk.  FORTRAN and C timings are an order of magnitude faster. # incrementsserialserial vSpmdparfor1drange1parfor2drange

Spring Array Distributions The purpose is to distribute data (e.g., an array) among workers to reduce the memory usage and workload on each worker in return for smaller wall clock time. For some parallel applications, creating a distributed array often is the only thing you need to do to make your application to run in parallel (e.g., due to function overloading). Some operations distribute data automatically while others require manual distribution.

Spring How To Distribute Arrays Utilities to distribute arrays:  distributed – distribute data from client; convenient but with restrictions  codistributed – used in spmd to distribute data on backend  Composite – distribute data on backend; access on client Methods to distribute data:  Partitioning a larger array.  Building from smaller arrays.  Created with MATLAB constructor function (rand, zeros,...).

Spring Data Parallel Example – Matrix Multiply >> matlabpool open 4 >> n = 3000; A = rand(n); B = rand(n); >> C = A * B; % run with 4 threads >> maxNumCompThreads(1);% set threads to 1 >> C1 = A * B; % run on single thread >> a = distributed(A); % distributes A, B from client >> b = distributed(B); % a, b on workers; accessible from client >> c = a * b; % run on workers; c is distributed >> matlabpool close Wall clock time, in seconds, for the above operations C1 = A * B (1 thread) C = A * B (4 threads) a = distribute(A) b = distribute(B) c = a * b (4 workers) The cost for distributing the matrices is recorded separately as this is incurred only once over the life of the distributed matrices. Time required to distribute matrices is not significantly affected by matrix size.

Spring Additional Ways to Distribute Matrices There are alternative ways to distribute matrices. >> matlabpool open 4 >> A = rand(3000); B = rand(3000); >> spmd p = rand(n, codistributor1d(1)); % 2 ways to directly create q = codistributed.rand(n); % distributed random array s = p * q; % run on workers; s is distributed % distribute matrix after it is created u = codistributed(A, codistributor1d(1)); % by row v = codistributed(B, codistributor1d(2)); % by column w = u * v; % run on workers; w is distributed end >> matlabpool close

Spring Preferred Way to Distribute Matrices ? For matrix-matrix multiply, there are 4 combinations on how to distribute the 2 matrices (by row or column). While all 4 ways lead to the correct solution, some perform better than others. n = 3000; A = rand(n); B = rand(n); spmd ar = codistributed(A, codistributor1d(1)) % distributed by row ac = codistributed(A, codistributor1d(2)) % distributed by column br = codistributed(B, codistributor1d(1)) % distributed by row bc = codistributed(B, codistributor1d(2)) % distributed by column crr = ar * br; crc = ar * bc; ccr = ac * br; ccc = ac * bc; end Wall clock times of the four ways to distribute A and B C (row x row)C (row x col)C (col x row)C (col x col) The above is true for on-node or across-nodes runs. Across-nodes array distribution and runs are more likely to be slower than on-node ones due to slower communications. Specifically for MATLAB, Katana uses 100-Mbits Ethernet for across-nodes communications instead of Gigabit InfiniBand.

Spring Linear algebraic system Example: Ax = b matlabpool open 4 % serial operations n = 3000; M = rand(n); x = ones(n,1); [A, b] = linearSystem(M, x); u = A\b; % solves Au = b; u should equal x clear A b % parallel operations in spmd spmd m = codistributed(M, codistributor('1d',2)); % by column y = codistributed(x, codistributor(‘1d’,1)); % by row [A, b] = linearSystem(m, y); v = A\b; end clear A b m y % parallel operations from client m = distributed(M); y = distributed(x); [A, b] = linearSystem(m, y); W = A\b; matlabpool close function [A, b] = linearSystem(M, x) % Returns A and b of linear system Ax = b A = M + M'; % A is real and symmetric b = A * x; % b is the RHS of linear system

Spring Task Parallel vs. Data Parallel matlabpool open 4 n = 3000; M = rand(n); x = ones(n,1); % Solves 4 cases of Ax=b sequentially, each with 4 workers for i=1:4 m = distributed(M); y = distributed(x*i); [A, b] = linearSystem(m, y); % computes with 4 workers u = A\b; % solves each case with 4 workers end clear A b m y % solves 4 cases of Ax=b concurrently (with parfor) parfor i=1:4 [A, b] = linearSystem(M, x*i); % computes on 1 worker v = A\b; % solves with 1 worker end % solves 4 cases of Ax=b concurrently (with drange) spmd for i=drange(1:4) [A, b] = linearSystem(M, x*i); % computes on 1 worker w = A\b; % 1 worker end matlabpool close function [A, b] = linearSystem(M, x) % Returns A and b of linear system Ax = b A = M + M'; % A is real and symmetric b = A * x; % b is the RHS of linear system

Spring How Do I Parallelize My Code ? 1.Profile serial code with profile. 2.Identify section of code or function within code using the most CPU time. 3.Look for ways to improve code section performance. 4.See if section is parallelizable and worth parallelizing. 5.If warranted, research and choose a suitable parallel algorithm and parallel paradigm for code section. 6.Parallelize code section with chosen parallel algorithm. 7.To improve performance further, work on the next most CPU-time-intensive section. Repeats steps 2 – 7. 8.Analyze the performance efficiency to know what the sweet-spot is, i.e., given the implementation and platforms on which code is intended, what is the minimum number of workers for speediest turn-around (see the Amdahl’s Law page).

Spring How Well Does PCT Scales ? Task parallel applications generally scale linearly. Data parallel applications’ parallel efficiency depend on individual code and algorithm used. My personal experience is that the runtime of a reasonably well tuned MATLAB program running on single or multi- processor is at least an order of magnitude slower than an equivalent C/FORTRAN code. Additionally, the PCT’s communication is based on the Ethernet. MPI on Katana uses Infiniband and is faster than Ethernet. This further disadvantaged PCT if your code is communication-bound.

Spring Speedup Ratio and Parallel Efficiency S is ratio of T 1 over T N, elapsed times of 1 and N workers. f is fraction of T 1 due to code sections not parallelizable. Amdahl’s Law above states that a code with its parallelizable component comprising 90% of total computation time can at best achieve a 10X speedup with lots of workers. A code that is 50% parallelizable speeds up two-fold with lots of workers. The parallel efficiency is E = S / N Program that scales linearly (S = N) has parallel efficiency 1. A task-parallel program is usually more efficient than a data- parallel program. Parallel codes can sometimes achieve super-linear behavior due to efficient cache usage per worker.

Spring Example of Speedup Ratio & Parallel Efficiency

Spring Batch Processing On Katana MATLAB provides various ways to submit and run jobs in the background or in batch. SCV recommends a simple and effective method for all PCT batch processing on Katana. Cut-and-paste the below into a file, say, mbatch #!/bin/csh -f # MATLAB script for running serial or parallel background jobs # For MATLAB applications that contain MATLAB Parallel Computing Toolbox # parallel operations request (matlabpool or dfeval), a batch job # will be queued and run in batch (using the SGE configuration) # Script name: mbatch (you can change it) # Usage: katana% mbatch # : name of m-file to be executed, DONOT include.m ($1) # : output file name; may include path ($2) nohup matlab –nodisplay -r “$1 exit” >! $2 & katana% chmod +x mbatch give mbatch execute attribute Katana% mbatch my_mfile myOutput Do not start MATLAB with -nojvm. PCT requires Java.

Spring Use ‘local’ Config. on Katana You can use the Katana Cluster as if it is ‘local’ with 4 workers for interactive usages: 1.Request 4 processors on the same node (You need x-win32 on your client computer) Katana% qsh –pe omp 4 2. In the new X-window, run matlab Katana% matlab 3. Request workers in MATLAB window >> matlabpool open local 4 or >> pmode start local 4 Generally, ‘SGE’ is the default configuration. If ‘local’ is not specified, matlabpool would request 4 workers using the ‘SGE’ configuration and hence the processors allocated with the qsh interactive batch shell would not be used. Only need a PCT license; no worker licenses needed. Request a node with 4 processors.

Spring Communications between workers and MATLAB client pmode: lab2client, client2lab matlabpool: A = a{1} % from lab 1 to client Collective communications among workers gather, gop, gplus, gcat Communications Among Workers, Client

Spring MPI point-to-point communication among workers labSend and labReceive % Example: each lab sends its lab # to lab 1 and sum data on lab 1 matlabpool open 4 spmd % MPI requires spmd a = labindex; % define a on workers If labindex == 1 % on worker 1... % lab 1 is designated to accumulate sum on a s = a; % sum s starts with a on worker 1 for k = 2:numlabs % loop over remaining workers s = s + labReceive(k); % receive a from worker k, then add to s end else % for all other workers... labSend(a,1); % send a on workers 2 to 4 to worker 1 end end % spmd indexSum = s{1}; % copy s on lab 1 to client matlabpool close It is illegal to communicate with itself. MPI Point-to-Point Communications

Please help us do better in the future by participating in a quick survey: survey  SCV home page (  Resource Applications  Help System bu.service-now.com Web-based tutorials ( (MPI, OpenMP, MATLAB, IDL, Graphics tools) HPC consultations by appointment Kadin Tseng Useful SCV Info MATLAB Parallel Computing Toolbox 43