Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High.

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High Performance Computing lecture Benjamin Drozdenko MathWorks TA & Graduate Research Assistant MathWorksHelp@neu.edu February 17, 2016

MathWorks Teaching Assistant   Need Help with MATLAB or Simulink? Then contact the MathWorks Teaching Assistant, Ben Drozdenko, a Ph.D. student and former MathWorks employee. For personalized, immediate on-campus assistance, please attend my Spring 2016 Office Hours: Snell Library, 1st Flr, Rm 138B Mondays: 1:30-5:30 pm Wednesdays: 9 am-12 pm Fridays: 12 pm-5 pm To find the office, enter Snell Library and go down the hallway to the left of the information desk. At the end of the hallway, turn left at the printing station. Office hours are in the smaller study room, the “Bullpen”, room 138B. For general or public MATLAB & Simulink questions, please visit my Blackboard Site: MATLAB Help & Mentoring http://j.mp/neu-matlab-help For specific or private MATLAB & Simulink questions, please send me an E-mail: MathWorksHelp@neu.edu  

Lecture Outline Parallel Computing Toolbox™ (PCT) and MATLAB® Distributed Computing Server™ (MDCS) Starting a Parallel Pool using parpool Task-Parallel Jobs with PCT Embarrassingly Parallel Tasks using parfor Setting up a Simple Independent Job using createJob Data-Parallel Jobs with PCT Single Program, Multiple Data using spmd Distributed Datasets and Operations Passing Messages Setting up a Communicating Job Increasing Scale using MDCS on the Discovery Cluster GPU Computing using PCT on the Discovery Cluster EECE5640 SP2016

MathWorks® Products MATLAB Parallel Computing Toolbox™ (PCT) High-level language and interactive environment for numerical computation, visualization, and programming. Language, tools, and built-in math functions to explore multiple approaches and reach a solution faster. Used for a range of applications, including signal processing and communications, image and video processing, control systems, test and measurement, computational finance, and computational biology. Parallel Computing Toolbox™ (PCT) Solve computationally-intensive and data-intensive problems using multicore processors, GPUs, and computer clusters. High-level constructs—parallel for-loops, special array types, and parallelized numerical algorithms—to parallelize MATLAB® applications without CUDA or MPI programming. MATLAB® Distributed Computing Server™ (MDCS) Run computationally intensive MATLAB® programs and Simulink® models on computer clusters, clouds, & grids. Develop your program or model on a multicore desktop computer using PCT and then scale up to many computers by running it on MDCS. Server includes a built-in cluster job scheduler and provides support for commonly used third-party schedulers (e.g. Platform LSF, MS Windows HPCS, PBS TORQUE, etc.). Source: MathWorks® Product Page: http://www.mathworks.com/products/ EECE5640 SP2016 EECE5640 SP2015

Starting Parallel Pool with parpool To get an estimate of the #computational cores on local machine: >> maxNumCompThreads To open a pool of 4 workers, type at the command prompt: >> parpool(4) This command starts 4 new instances of MATLAB, which are available for computations as part of a resource pool. In Windows, type Ctrl-Alt-Del to start Task Manager. Select Processes. In total, there are now 4 instances of MATLAB.exe running on the system—4 as part of the pool + original instance. To query the current pool: >> p = gcp; p.NumWorkers A parallel pool automatically starts when you execute a parallel language construct that runs on a pool, such as parfor or spmd. When finished, to close pool, type: >> delete(gcp) EECE5640 SP2016

Lecture Outline (2) Parallel Computing Toolbox™ (PCT) and MATLAB® Distributed Computing Server™ (MDCS) Starting a Parallel Pool using parpool Task-Parallel Jobs with PCT Embarrassingly Parallel Tasks using parfor Setting up a Simple Independent Job using createJob Data-Parallel Jobs with PCT Single Program, Multiple Data using spmd Distributed Datasets and Operations Passing Messages Setting up a Communicating Job Increasing Scale using MDCS on the Discovery Cluster GPU Computing using MATLAB on the Discovery Cluster EECE5640 SP2016

Task-Parallel Jobs with PCT Parallel For Loops with parfor keyword Each worker runs independently Ideally suited for embarrassingly-parallel tasks Parallel Jobs using createJob function Assign specific tasks to run on workers Specific rules for usage No communication between loop iterations EECE5640 SP2016

Example #1: Birthday Paradox What is the probability that in a group of 30 randomly selected individuals, at least two of the individuals will share the same birthday? Assuming independent events, 𝑝 𝑏𝑑𝑎𝑦 =1− 30!∙ 365 30 365 30 ≈70.6% MATLAB® code for one trial: function match = birthday(groupSize) bdays = randi(365, groupSize, 1); bdays = sort(bdays); match = any(diff(bdays) == 0); Code to run many trials sequentially (“brute-force algorithm”): function prob = runBirthday(numtrials, groupsize) matches = false(1,numtrials); for trial = 1:numtrials matches(trial) = birthday(groupsize); end prob = sum(matches)/numtrials; EECE5640 SP2016

Parallel For Loops with parfor Use the keyword parfor to make any for loop into a parallel loop that runs the independent iterations on different workers Code to run many trials in parallel: function prob = pRunBirthday(numtrials, groupsize) matches = false(1,numtrials); parfor trial = 1:numtrials matches(trial) = birthday(groupsize); end prob = sum(matches)/numtrials; EECE5640 SP2016

Time Code using tic and toc With pool still open, time parallel version >> tic; p = pRunBirthday(1e5,30), toc Close pool, and time sequential version >> delete(gcp); >> tic; p = runBirthday(1e5,30), toc On my local machine, Ntrials 1e5 1e6 Sequential 0.40 sec 4.00 sec Parallel, Nworkers=4 0.24 sec 1.81 sec Speedup 1.67X 2.2X EECE5640 SP2016

Setup a Parallel Job using createJob Observe the behavior of a parallel for loop from the following output: >> parfor i=1:10, disp(i); end Parfor allows for little control over parallel execution. To assign specific tasks to each worker, create an independent job instead: Connect to a cluster: >> cluster = parcluster('local'); Create an independent job: >> job = createJob(cluster); EECE5640 SP2016

Setup a Parallel Job using createJob (cont.) Create many tasks for the job to handle (you could use a for loop or a while loop for this). >> createTask(job, @runBirthday, 1, {1e5,5}); >> createTask(job, @runBirthday, 1, {1e5,10}); >> createTask(job, @runBirthday, 1, {1e5,15}); >> createTask(job, @runBirthday, 1, {1e5,20}); >> createTask(job, @runBirthday, 1, {1e5,25}); >> createTask(job, @runBirthday, 1, {1e5,30}); Submit the job. Optionally, wait for it to finish. >> submit(job); wait(job, 'finished'); Retrieve the results. >> results = fetchOutputs(job); >> results{end,1} >> r = cell2mat(results); mean(r) When finished, delete job & clear object. >> delete(job); clear job; EECE5640 SP2016

Example #2: Gene Matching function results = pargenematchsol() searchSeq = repmat('gattaca', 1, 10); numTasks = 2; numBases = 7048095; cluster = parcluster('local'); job = createJob(cluster); [startValues, endValues] = splitDataset(numBases, numTasks); offsetLeft = floor(length(searchSeq)/2); if mod(length(searchSeq),2) == 0 offsetRight = offsetLeft - 1; else offsetRight = offsetLeft; end startValues(2:end) = startValues(2:end) - offsetLeft; endValues(1:end-1) = endValues(1:end-1) + offsetRight; for tasknum = 1:numTasks createTask(job, @genematch, 2, {searchSeq, 'gene.txt', ... startValues(tasknum), endValues(tasknum)}); EECE5640 SP2016

Example #2: Gene Matching (cont.) submit(job); % Submit and Wait for Results wait(job, 'finished'); results = fetchOutputs(job); % Report the results results = cell2mat(results); % Return absolute position [~,idx] = max(results(:,1)); bpm = results(idx,1); msi = results(idx,2)+startValues(idx)-1; function [startValues, endValues] = splitDataset(numTotalElements, numTasks) numPerTask = repmat(floor(numTotalElements/numTasks), 1, numTasks); leftover = rem(numTotalElements, numTasks); numPerTask(1:leftover) = numPerTask(1:leftover) + 1; endValues = cumsum(numPerTask); startValues = [1 endValues(1:end-1) + 1]; EECE5640 SP2016

Example #2: Gene Matching (cont.) function [bestPctMatch,matchStartIdx]=genematch(searchSeq,file,startIdx,endIdx) fid = fopen(file, 'rt'); geneSeq = fscanf(fid, '%c'); fclose(fid); if nargin < 3, startIndex = 1; end if nargin < 4, endIndex = length(geneSeq); end [bestPctMatch,matchStartIdx]=findsubstr(geneSeq(startIdx:endIdx),searchSeq); function [bestPctMatch,matchStartIdx]=findsubstr(baseString,searchString) bestPctMatch = 0; matchStartIdx = 0; for startIdx = 1:(length(baseString)-length(searchString)+1) currentSection = baseString(startIdx:startIdx+length(searchString)-1); pctMatch = nnz(currentSection==searchString)/length(searchString); if pctMatch >= bestPctMatch bestPctMatch = pctMatch; matchStartIdx = startIdx; end EECE5640 SP2016

Lecture Outline (3) Parallel Computing Toolbox™ (PCT) and MATLAB® Distributed Computing Server™ (MDCS) Starting a Parallel Pool using parpool Task-Parallel Jobs with PCT Embarrassingly Parallel Tasks using parfor Setting up a Simple Independent Job using createJob Data-Parallel Jobs with PCT Single Program, Multiple Data using spmd Distributed Datasets and Operations Passing Messages Setting up a Communicating Job Increasing Scale using MDCS on the Discovery Cluster GPU Computing using MATLAB on the Discovery Cluster EECE5640 SP2016

Data-Parallel Jobs with PCT Single Program, Multiple Data with spmd Self-identification using labindex and numlabs Types of arrays—replicated, variant, private Composite data type Distributed Datasets and Operations Passing Messages Practical Considerations Setting up Communicating Jobs using createCommunicatingJob Assign one task to run on all labs A lab can pass messages to other labs EECE5640 SP2016

Single Program, Multiple Data: spmd >> labindex >> numlabs >> end Lab 1: 1 4 Lab 2: 2 4 Lab 3: 3 4 Lab 4: 4 4 >> spmd >> code >> end Lab 1 Lab 2 Lab 3 Lab 4 >>code >>code >>code >>code EECE5640 SP2016

Types of Arrays on Labs in SPMD Replicated Array >> spmd >> x = 5; >> end Variant Array >> spmd >> y = rand; >> end Private Array >> spmd >> if (labindex==2) >> z = 7; >> end >> end Lab 1 x = 5 Lab 2 Lab 3 Lab 4 Lab 1 y = 0.3246 Lab 2 y = 0.2646 Lab 3 y = 0.8847 Lab 4 y = 0.8939 Lab 1 Lab 2 z = 7 Lab 3 Lab 4 EECE5640 SP2016

Composite Class & Reductions From the MATLAB client, all these three types of arrays show up in the workspace as a Composite data type. >> class(y) Use the curly braces to extract the contents at any lab index. >> y{3} Use a global operation to combine the results from all the labs. This performs a Reduction & Broadcast, which turns a Composite variable into a replicated array. >> spmd >> ay = gcat(y); % Global concatenation >> sy = gplus(y); % Global summation >> my = gop(@max,y); % Global maximum >> end Or, specify a lab index to perform a Reduction and turn it into a private array: >> spmd >> ay1 = gcat(y,1,1); >> sy1 = gplus(y,1); >> my1 = gop(@max,y,1); >> end EECE5640 SP2016

Distributed Datasets and Operations Use Distributed Data Type from Client. vars = load('airportdata'); dlat = distributed(vars.lat); dlong = distributed(vars.long); OR: Use Codistributed Data Type from Labs. spmd vars = load('airportdata'); clat = codistributed(vars.lat); clong = codistributed(vars.long); end Use gather function to convert to replicated array. >> allat = gather(dlat); Or, specify a lab index to convert to a private array. >> allat1 = gather(dlat, 1); Also, use getLocalPart to convert to a variant array. >> loclat = getLocalPart(dlat); EECE5640 SP2016

Example #3: Airport Distances spmd R = 3963.2; % in miles vars = load('airportdata'); lat = codistributed(vars.lat); long = codistributed(vars.long); lat = 90 - lat; long = 360 + long; x = R * sind(lat) .* cosd(long); y = R * sind(lat) .* sind(long); z = R * cosd(lat); coords = [x y z]'; dotprod = coords' * coords; mag = sqrt(sum(coords.^2)); angles = min(dotprod ./ (mag' * mag), 1); dist = R * acos(angles); % Arc length end EECE5640 SP2016

Passing Messages To send a variable x to another lab: >> labSend(x, dest_labindex); To receive a variable x from another lab: >> x = labReceive(src_labindex); To see whether a lab is ready to receive data: >> isReady = labProbe(dest_labindex); To broadcast a variable x to all other labs: >> x = labBroadcast(src_lab,x); To synchronize all the labs: >> labBarrier; EECE5640 SP2016

Passing Messages: Practical Considerations spmd switch labindex case 1 x1 = labindex * ones(1, 5); % Create local data x2 = labReceive(2); % Receive data from peer labSend(x1, 2); % Send data to peer y = x2; % Return peer's data case 2 x2 = labindex * ones(1, 5); % Create local data x1 = labReceive(1); % Receive data from peer labSend(x2, 1); % Send data to peer y = x1; % Return peer's data end end Deadlock! Use labSendReceive function instead to exchange data between labs. spmd switch labindex case 1 x1 = labindex * ones(1, 5); % Create local data x2 = labSendReceive(2, 2, x1); % Exchange data with lab 2 y = x2; % Return peer's data case 2 x2 = labindex * ones(1, 5); % Create local data x1 = labSendReceive(1, 1, x2); % Exchange data with lab 1 y = x1; % Return peer's data end end Lab 1 waits to receive Lab 2 waits to receive EECE5640 SP2016

Example #4: Parallel Heat Equation Get a matrix representing the temperature at each point of a 2D square plate of length L & diffusivity c. For example, solve for the temperature on a 3m-by-3m copper plate after 40 seconds have elapsed, using 500 time steps of 80 ms each. Thermal diffusivity of copper is 1.13e-4 m^2/s. EECE5640 SP2016

Example #4: Sequential Heat Equation function U = heateq(k, n, Ts, L, c) ms = L / n; if Ts > (ms^2/2/c), error('Selected time step is too large.'); end U = initialTempDistrib(n); north = 1:n; south = 3:(n + 2); curr = 2:(n + 1); east = 3:(n + 2); west = 1:n; for iter = 1:k U(curr, curr) = U(curr, curr) + c * Ts/(ms^2) * (U(north, curr) + ... U(south, curr) - 4*U(current, curr) + U(curr, east) + U(curr, west)); end function U = initialTempDistrib(n) U = 23*ones(n + 2); U(1, :) = (1:(n + 2))*700/(n + 2); U(end, :) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); U(:, 1) = (1:(n + 2))*700/(n + 2); U(:, end) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); EECE5640 SP2016

Setting up Communicating Jobs Connect to a cluster. >> cluster = parcluster('local'); Create a communicating job. >> job = createCommunicatingJob(cluster,'Type','SPMD'); Create a repeating task for the job to handle. >> createTask(job,@parheateqn,1,... {1e3,500,0.08,3,1.13e-4}); Set a range for the number of workers needed. >> set(job,'NumWorkersRange',[3 3]); Submit the job. Optionally, wait for it to finish. >> submit(job); wait(job,'finished'); Retrieve the results. >> results = fetchOutputs(job); >> U = cell2mat(results'); >> imagesc(U) When finished, clean up. Delete job & clear object. >> delete(job); clear job; EECE5640 SP2016

Example #4: Parallel Heat Equation function U = parheateqn(k, n, Ts, L, c) ms = L / n; if (Ts>(ms^2/2/c)), error('Selected time step is too large.'); end Uinit = initialTempDistrib(n); parts = codistributor1d.defaultPartition(n+2); numLocalCols = parts(labindex); leftColInd = sum(parts(1:labindex - 1)) + 1; rightColInd = leftColInd + numLocalCols - 1; U = Uinit(:, leftColInd:rightColInd); if (labindex > 1), U = [zeros(n+2, 1) U]; end if (labindex < numlabs), U = [U zeros(n+2, 1)]; end if (labindex == 1) || (labindex == numlabs) numLocalCols = numLocalCols - 1; end rightNeighbor = mod(labindex, numlabs) + 1; leftNeighbor = mod(labindex - 2, numlabs) + 1; north = 1:n; south = 3:n + 2; currRow = 2:n + 1; currCol = 2:numLocalCols + 1; east = 3:numLocalCols + 2; west = 1:numLocalCols; EECE5640 SP2016

Example #4: Parallel Heat Equation (cont.) for iter = 1:k rightBoundary = labSendReceive(leftNeighbor,rightNeighbor,U(:,2)); leftBoundary = labSendReceive(rightNeighbor,leftNeighbor,U(:,end-1)); if (labindex > 1), U(:, 1) = leftBoundary; end if (labindex < numlabs), U(:, end) = rightBoundary; end % Update grid for current iteration U(currRow,currCol) = U(currRow,currCol) + ... c*Ts/(ms^2)*(U(north,currCol) + U(south,currCol) - ... 4*U(currRow,currCol) + U(currRow,east) + U(currRow,west)); end % Combine parts from all labs into a single matrix stored on lab 1 U = gcat(U(currRow, currCol), 2, 1); function U = initialTempDistrib(n) U = 23*ones(n + 2); U(1, :) = (1:(n + 2))*700/(n + 2); U(end, :) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); U(:, 1) = (1:(n + 2))*700/(n + 2); U(:, end) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); EECE5640 SP2016

Summary: Problem Types Interactive Batch Task- Parallel parpool parfor createJob createTask Data- spmd createCommunicatingJob EECE5640 SP2016

Lecture Outline (4) Parallel Computing Toolbox™ (PCT) and MATLAB® Distributed Computing Server™ (MDCS) Starting a Parallel Pool using parpool Task-Parallel Jobs with PCT Embarrassingly Parallel Tasks using parfor Setting up a Simple Independent Job using createJob Data-Parallel Jobs with PCT Single Program, Multiple Data using spmd Distributed Datasets and Operations Passing Messages Setting up a Communicating Job Increasing Scale using MDCS on the Discovery Cluster GPU Computing using MATLAB on the Discovery Cluster EECE5640 SP2016

Increasing Scale using Multiple Systems with MDCS EECE5640 SP2016

MDCS One-Time Setup on the Discovery Cluster Load the required modules for MATLAB R2013b. > module whatis matlab_dce_2013b Best practice is to add the following lines to your ~/.bashrc file (if they’re not already in there): module load gnu-4.4-compilers module load fftw-3.3.3 module load platform-mpi module load oracle_java_1.7u40 module load matlab_dce_2013b Then, log out and log back in to the Discovery Cluster to effect your changes. Copy .matlab directory to your home directory. > cp -R /shared/apps/matlab/matlab-2013b/env_script/.matlab ~/. Get a compute node on the ht-10g queue interactively. > bsub –Is –n 2 –q ht-10g /bin/bash Output like: <<Starting on compute-0-007>> Verify that the proper modules have been loaded. > module list Run MATLAB with no display to verify that you have MATLAB installed correctly. > matlab –logfile ./output.txt –dmlworker –nodisplay –r “ver;exit” The terminal output shows MATLAB start, display all its product versions, and then exit. If you’re done on the compute node, exit out of it. > exit Source: http://nuweb12.neu.edu/rc/?page_id=18#matjobs EECE5640 SP2016

Running an MDCS Submit Script on the Discovery Cluster Create a Platform LSF submit script called “bsub_parfor.bash” with the following content: #!/bin/bash #BSUB –L /bin/bash #BSUB –J BensParforJob.01 #BSUB –q ht-10g #BSUB –o %J.out #BSUB –e %J.err #BSUB –n 9 work=/home/drozdenko.b/hpc/matlab_dcs_test MATLAB_infile=parfor_parallel cd $work matlab –logfile ./output.txt –nodisplay –r $MATLAB_infile Always set –n to one more than the number of MATLAB worker threads your code expects. Submit the job and check your job and output. > bsub < bsub_parfor.bash > bjobs –w Output is something like: JOBID 36768/USER drozdenko.b/STAT RUN… > bpeek 582677 (if bjobs shows that your job’s status is still running) > cat output.txt (once bjobs shows that your job is finished) EECE5640 SP2016

Running Jobs in Batch Mode from MATLAB GUI on the Discovery Cluster Start an interactive session with X11-forwarding on the ht-10g queue. > bsub –Is -XF –n 1 –q ht-10g /bin/bash Output is something like: <<Starting on compute-0-006>> Ensure that the modules needed for MATLAB are loaded > module list Run MATLAB > matlab & Configure cluster profile settings. >> configCluster('discovery'); >> ClusterInfo.setQueueName('ht-10g') >> ClusterInfo.setProcsPerNode(16) Submit batch SPMD job using the batch function. >> j = batch('spmd_parallel','matlabpool',16); Wait for job to finish. Check the diary. Fetch the outputs. >> j.State >> j.wait >> j.diary >> out = j.fetchOutputs{:} EECE5640 SP2016

GPU Setup on Discovery Cluster Load required CUDA modules (in addition to already loaded MATLAB R2013b modules). > module whatis cuda-5.5 Best to add the following lines to your ~/.bashrc file (if they’re not already in there): module load gnu-4.4-compilers module load fftw-3.3.3 module load platform-mpi module load cuda-5.5 Start an interactive session with X11-forwarding on the par-gpu queue. > bsub –Is -XF –n 1 –q par-gpu-2 /bin/bash Output: <<Starting on compute-2-160>> Run MATLAB. > matlab & Confirm that you are connected to a GPU device. >> gpuDeviceCount MATLAB command line output should be “ans = 1”. Get GPU device information: >> d = gpuDevice Output shows properties in table at right (among others): Source: http://nuweb12.neu.edu/rc/?page_id=18#gpujobs http://www.mathworks.com/help/distcomp/identify-and-select-a-gpu-device.html GPU Property Value Name Tesla K20m/40m ComputeCapability 3.5 SupportsDouble 1 MaxThreadsPerBlock 1024 MaxShmemPerBlock 49152 MaxThreadBlockSize [1024 1024 64] MaxGridSize [2.1475e+09 65535 65535] EECE5640 SP2016

Run Built-in Functions on GPU from MATLAB GUI on the Discovery Cluster Try running and timing the MATLAB built-in function FFT with an array of 10 million random doubles. n = 1e7; r = rand(n,1); rf = fft(r); Output is like: Elapsed time is 0.151070 seconds. Next, put the array on the GPU and run the GPU version of the FFT function on the same array. When you’re done with the GPUArray data, use the gather() function to transfer it back to your local workspace. tic; g=gpuArray(r); gf=fft(g); gg=gather(gf); toc; Output is like: Elapsed time is 0.124526 seconds. Note that the GPU version runs slightly faster with an array of this size. Refer to MathWorks documentation for the latest list of built-in functions: http://www.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html EECE5640 SP2016

Run CUDA PTX files on GPU from MATLAB GUI on the Discovery Cluster Create a CUDA C kernel function. This add2 function adds two double vectors. __global__ void add2(double *v1, const double *v2) { int idx = threadIdx.x; v1[idx] += v2[idx]; } Next, compile the CUDA C kernel function using nvcc, producing only the .PTX file. > nvcc -ptx gpufcn.cu In the MATLAB GUI, create a CUDAKernel object and set its properties. >> k=parallel.gpu.CUDAKernel('gpufcn.ptx','gpufcn.cu','add'); >> k.ThreadBlockSize = 128; Call the feval function to run the CUDA kernel with gpuArray data. >> x1 = gpuArray(rand(n,1)); >> x2 = gpuArray(rand(n,1)); >> y = feval(k,x1,x2); >> yg = gather(y); Refer to MathWorks documentation for more detailed instructions: http://www.mathworks.com/help/distcomp/run-cuda-or-ptx-code-on-gpu.html EECE5640 SP2016

Cleanup on Discovery Cluster Close any interactive MATLAB GUI windows. >> exit Check to see if you have any other processes still running on each compute node. > ps Ignore ps and bash. For other listed processes, use the kill command. > kill 29777 Exit each interactive session on a compute node. > exit (each compute node) Check to see if you have any remaining jobs still running or queued. >> bjobs –w Ignore jobs on a discovery login node where JOB_NAME is /bin/bash. For all other listed jobs, use the bkill command. >> bkill 506225 When finished, exit each interactive session and exit Discovery. > exit (the Discovery cluster) EECE5640 SP2016 EECE5640 SP2015

Conclusion Use Parallel Computing Toolbox on your local machine to prototype your parallel algorithms. Use parfor to enact task-parallel algorithms (like with OpenMP) and spmd to enact data-parallel algorithms (like with MPI). Move your parallel algorithms onto the Discovery Cluster to see significant speedup using MDCS. You can also perform GPU computing on the Discovery Cluster using gpuArray’s, built-in functions, CUDA PTX files, and gather. Questions? EECE5640 SP2016