Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High.

Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High Performance Computing lecture Benjamin Drozdenko MathWorks TA & Graduate Research Assistant February 17, 2016

MathWorks Teaching Assistant
Need Help with MATLAB or Simulink? Then contact the MathWorks Teaching Assistant, Ben Drozdenko, a Ph.D. student and former MathWorks employee. For personalized, immediate on-campus assistance, please attend my Spring 2016 Office Hours: Snell Library, 1st Flr, Rm 138B Mondays: 1:30-5:30 pm Wednesdays: 9 am-12 pm Fridays: 12 pm-5 pm To find the office, enter Snell Library and go down the hallway to the left of the information desk. At the end of the hallway, turn left at the printing station. Office hours are in the smaller study room, the “Bullpen”, room 138B. For general or public MATLAB & Simulink questions, please visit my Blackboard Site: MATLAB Help & Mentoring For specific or private MATLAB & Simulink questions, please send me an

Lecture Outline Parallel Computing Toolbox™ (PCT) and MATLAB® Distributed Computing Server™ (MDCS) Starting a Parallel Pool using parpool Task-Parallel Jobs with PCT Embarrassingly Parallel Tasks using parfor Setting up a Simple Independent Job using createJob Data-Parallel Jobs with PCT Single Program, Multiple Data using spmd Distributed Datasets and Operations Passing Messages Setting up a Communicating Job Increasing Scale using MDCS on the Discovery Cluster GPU Computing using PCT on the Discovery Cluster EECE5640 SP2016

MathWorks® Products MATLAB Parallel Computing Toolbox™ (PCT)
High-level language and interactive environment for numerical computation, visualization, and programming. Language, tools, and built-in math functions to explore multiple approaches and reach a solution faster. Used for a range of applications, including signal processing and communications, image and video processing, control systems, test and measurement, computational finance, and computational biology. Parallel Computing Toolbox™ (PCT) Solve computationally-intensive and data-intensive problems using multicore processors, GPUs, and computer clusters. High-level constructs—parallel for-loops, special array types, and parallelized numerical algorithms—to parallelize MATLAB® applications without CUDA or MPI programming. MATLAB® Distributed Computing Server™ (MDCS) Run computationally intensive MATLAB® programs and Simulink® models on computer clusters, clouds, & grids. Develop your program or model on a multicore desktop computer using PCT and then scale up to many computers by running it on MDCS. Server includes a built-in cluster job scheduler and provides support for commonly used third-party schedulers (e.g. Platform LSF, MS Windows HPCS, PBS TORQUE, etc.). Source: MathWorks® Product Page: EECE5640 SP2016 EECE5640 SP2015

Starting Parallel Pool with parpool
To get an estimate of the #computational cores on local machine: >> maxNumCompThreads To open a pool of 4 workers, type at the command prompt: >> parpool(4) This command starts 4 new instances of MATLAB, which are available for computations as part of a resource pool. In Windows, type Ctrl-Alt-Del to start Task Manager. Select Processes. In total, there are now 4 instances of MATLAB.exe running on the system—4 as part of the pool + original instance. To query the current pool: >> p = gcp; p.NumWorkers A parallel pool automatically starts when you execute a parallel language construct that runs on a pool, such as parfor or spmd. When finished, to close pool, type: >> delete(gcp) EECE5640 SP2016

Lecture Outline (2) Parallel Computing Toolbox™ (PCT) and MATLAB® Distributed Computing Server™ (MDCS) Starting a Parallel Pool using parpool Task-Parallel Jobs with PCT Embarrassingly Parallel Tasks using parfor Setting up a Simple Independent Job using createJob Data-Parallel Jobs with PCT Single Program, Multiple Data using spmd Distributed Datasets and Operations Passing Messages Setting up a Communicating Job Increasing Scale using MDCS on the Discovery Cluster GPU Computing using MATLAB on the Discovery Cluster EECE5640 SP2016

Task-Parallel Jobs with PCT
Parallel For Loops with parfor keyword Each worker runs independently Ideally suited for embarrassingly-parallel tasks Parallel Jobs using createJob function Assign specific tasks to run on workers Specific rules for usage No communication between loop iterations EECE5640 SP2016

Example #1: Birthday Paradox
What is the probability that in a group of 30 randomly selected individuals, at least two of the individuals will share the same birthday? Assuming independent events, 𝑝 𝑏𝑑𝑎𝑦 =1− 30!∙ ≈70.6% MATLAB® code for one trial: function match = birthday(groupSize) bdays = randi(365, groupSize, 1); bdays = sort(bdays); match = any(diff(bdays) == 0); Code to run many trials sequentially (“brute-force algorithm”): function prob = runBirthday(numtrials, groupsize) matches = false(1,numtrials); for trial = 1:numtrials matches(trial) = birthday(groupsize); end prob = sum(matches)/numtrials; EECE5640 SP2016

Parallel For Loops with parfor
Use the keyword parfor to make any for loop into a parallel loop that runs the independent iterations on different workers Code to run many trials in parallel: function prob = pRunBirthday(numtrials, groupsize) matches = false(1,numtrials); parfor trial = 1:numtrials matches(trial) = birthday(groupsize); end prob = sum(matches)/numtrials; EECE5640 SP2016

Time Code using tic and toc
With pool still open, time parallel version >> tic; p = pRunBirthday(1e5,30), toc Close pool, and time sequential version >> delete(gcp); >> tic; p = runBirthday(1e5,30), toc On my local machine, Ntrials 1e5 1e6 Sequential 0.40 sec 4.00 sec Parallel, Nworkers=4 0.24 sec 1.81 sec Speedup 1.67X 2.2X EECE5640 SP2016

Setup a Parallel Job using createJob
Observe the behavior of a parallel for loop from the following output: >> parfor i=1:10, disp(i); end Parfor allows for little control over parallel execution. To assign specific tasks to each worker, create an independent job instead: Connect to a cluster: >> cluster = parcluster('local'); Create an independent job: >> job = createJob(cluster); EECE5640 SP2016

Setup a Parallel Job using createJob (cont.)
Create many tasks for the job to handle (you could use a for loop or a while loop for this). >> 1, {1e5,5}); >> 1, {1e5,10}); >> 1, {1e5,15}); >> 1, {1e5,20}); >> 1, {1e5,25}); >> 1, {1e5,30}); Submit the job. Optionally, wait for it to finish. >> submit(job); wait(job, 'finished'); Retrieve the results. >> results = fetchOutputs(job); >> results{end,1} >> r = cell2mat(results); mean(r) When finished, delete job & clear object. >> delete(job); clear job; EECE5640 SP2016

Example #2: Gene Matching
function results = pargenematchsol() searchSeq = repmat('gattaca', 1, 10); numTasks = 2; numBases = ; cluster = parcluster('local'); job = createJob(cluster); [startValues, endValues] = splitDataset(numBases, numTasks); offsetLeft = floor(length(searchSeq)/2); if mod(length(searchSeq),2) == 0 offsetRight = offsetLeft - 1; else offsetRight = offsetLeft; end startValues(2:end) = startValues(2:end) - offsetLeft; endValues(1:end-1) = endValues(1:end-1) + offsetRight; for tasknum = 1:numTasks 2, {searchSeq, 'gene.txt', ... startValues(tasknum), endValues(tasknum)}); EECE5640 SP2016

Example #2: Gene Matching (cont.)
submit(job); % Submit and Wait for Results wait(job, 'finished'); results = fetchOutputs(job); % Report the results results = cell2mat(results); % Return absolute position [~,idx] = max(results(:,1)); bpm = results(idx,1); msi = results(idx,2)+startValues(idx)-1; function [startValues, endValues] = splitDataset(numTotalElements, numTasks) numPerTask = repmat(floor(numTotalElements/numTasks), 1, numTasks); leftover = rem(numTotalElements, numTasks); numPerTask(1:leftover) = numPerTask(1:leftover) + 1; endValues = cumsum(numPerTask); startValues = [1 endValues(1:end-1) + 1]; EECE5640 SP2016

Example #2: Gene Matching (cont.)
function [bestPctMatch,matchStartIdx]=genematch(searchSeq,file,startIdx,endIdx) fid = fopen(file, 'rt'); geneSeq = fscanf(fid, '%c'); fclose(fid); if nargin < 3, startIndex = 1; end if nargin < 4, endIndex = length(geneSeq); end [bestPctMatch,matchStartIdx]=findsubstr(geneSeq(startIdx:endIdx),searchSeq); function [bestPctMatch,matchStartIdx]=findsubstr(baseString,searchString) bestPctMatch = 0; matchStartIdx = 0; for startIdx = 1:(length(baseString)-length(searchString)+1) currentSection = baseString(startIdx:startIdx+length(searchString)-1); pctMatch = nnz(currentSection==searchString)/length(searchString); if pctMatch >= bestPctMatch bestPctMatch = pctMatch; matchStartIdx = startIdx; end EECE5640 SP2016

Data-Parallel Jobs with PCT
Single Program, Multiple Data with spmd Self-identification using labindex and numlabs Types of arrays—replicated, variant, private Composite data type Distributed Datasets and Operations Passing Messages Practical Considerations Setting up Communicating Jobs using createCommunicatingJob Assign one task to run on all labs A lab can pass messages to other labs EECE5640 SP2016

Single Program, Multiple Data: spmd
>> labindex >> numlabs >> end Lab 1: Lab 2: Lab 3: Lab 4: >> spmd >> code >> end Lab 1 Lab 2 Lab 3 Lab 4 >>code >>code >>code >>code EECE5640 SP2016

Types of Arrays on Labs in SPMD
Replicated Array >> spmd >> x = 5; >> end Variant Array >> spmd >> y = rand; >> end Private Array >> spmd >> if (labindex==2) >> z = 7; >> end >> end Lab 1 x = 5 Lab 2 Lab 3 Lab 4 Lab 1 y = Lab 2 y = Lab 3 y = Lab 4 y = Lab 1 Lab 2 z = 7 Lab 3 Lab 4 EECE5640 SP2016

Composite Class & Reductions
From the MATLAB client, all these three types of arrays show up in the workspace as a Composite data type. >> class(y) Use the curly braces to extract the contents at any lab index. >> y{3} Use a global operation to combine the results from all the labs. This performs a Reduction & Broadcast, which turns a Composite variable into a replicated array. >> spmd >> ay = gcat(y); % Global concatenation >> sy = gplus(y); % Global summation >> my = % Global maximum >> end Or, specify a lab index to perform a Reduction and turn it into a private array: >> spmd >> ay1 = gcat(y,1,1); >> sy1 = gplus(y,1); >> my1 = >> end EECE5640 SP2016

Distributed Datasets and Operations
Use Distributed Data Type from Client. vars = load('airportdata'); dlat = distributed(vars.lat); dlong = distributed(vars.long); OR: Use Codistributed Data Type from Labs. spmd vars = load('airportdata'); clat = codistributed(vars.lat); clong = codistributed(vars.long); end Use gather function to convert to replicated array. >> allat = gather(dlat); Or, specify a lab index to convert to a private array. >> allat1 = gather(dlat, 1); Also, use getLocalPart to convert to a variant array. >> loclat = getLocalPart(dlat); EECE5640 SP2016

Example #3: Airport Distances
spmd R = ; % in miles vars = load('airportdata'); lat = codistributed(vars.lat); long = codistributed(vars.long); lat = 90 - lat; long = long; x = R * sind(lat) .* cosd(long); y = R * sind(lat) .* sind(long); z = R * cosd(lat); coords = [x y z]'; dotprod = coords' * coords; mag = sqrt(sum(coords.^2)); angles = min(dotprod ./ (mag' * mag), 1); dist = R * acos(angles); % Arc length end EECE5640 SP2016

Passing Messages To send a variable x to another lab:
>> labSend(x, dest_labindex); To receive a variable x from another lab: >> x = labReceive(src_labindex); To see whether a lab is ready to receive data: >> isReady = labProbe(dest_labindex); To broadcast a variable x to all other labs: >> x = labBroadcast(src_lab,x); To synchronize all the labs: >> labBarrier; EECE5640 SP2016

Passing Messages: Practical Considerations
spmd switch labindex case x1 = labindex * ones(1, 5); % Create local data x2 = labReceive(2); % Receive data from peer labSend(x1, 2); % Send data to peer y = x2; % Return peer's data case x2 = labindex * ones(1, 5); % Create local data x1 = labReceive(1); % Receive data from peer labSend(x2, 1); % Send data to peer y = x1; % Return peer's data end end Deadlock! Use labSendReceive function instead to exchange data between labs. spmd switch labindex case x1 = labindex * ones(1, 5); % Create local data x2 = labSendReceive(2, 2, x1); % Exchange data with lab y = x2; % Return peer's data case x2 = labindex * ones(1, 5); % Create local data x1 = labSendReceive(1, 1, x2); % Exchange data with lab y = x1; % Return peer's data end end Lab 1 waits to receive Lab 2 waits to receive EECE5640 SP2016

Example #4: Parallel Heat Equation
Get a matrix representing the temperature at each point of a 2D square plate of length L & diffusivity c. For example, solve for the temperature on a 3m-by-3m copper plate after 40 seconds have elapsed, using 500 time steps of 80 ms each. Thermal diffusivity of copper is 1.13e-4 m^2/s. EECE5640 SP2016

Example #4: Sequential Heat Equation
function U = heateq(k, n, Ts, L, c) ms = L / n; if Ts > (ms^2/2/c), error('Selected time step is too large.'); end U = initialTempDistrib(n); north = 1:n; south = 3:(n + 2); curr = 2:(n + 1); east = 3:(n + 2); west = 1:n; for iter = 1:k U(curr, curr) = U(curr, curr) + c * Ts/(ms^2) * (U(north, curr) U(south, curr) - 4*U(current, curr) + U(curr, east) + U(curr, west)); end function U = initialTempDistrib(n) U = 23*ones(n + 2); U(1, :) = (1:(n + 2))*700/(n + 2); U(end, :) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); U(:, 1) = (1:(n + 2))*700/(n + 2); U(:, end) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); EECE5640 SP2016

Setting up Communicating Jobs
Connect to a cluster. >> cluster = parcluster('local'); Create a communicating job. >> job = createCommunicatingJob(cluster,'Type','SPMD'); Create a repeating task for the job to handle. >> {1e3,500,0.08,3,1.13e-4}); Set a range for the number of workers needed. >> set(job,'NumWorkersRange',[3 3]); Submit the job. Optionally, wait for it to finish. >> submit(job); wait(job,'finished'); Retrieve the results. >> results = fetchOutputs(job); >> U = cell2mat(results'); >> imagesc(U) When finished, clean up. Delete job & clear object. >> delete(job); clear job; EECE5640 SP2016

Example #4: Parallel Heat Equation
function U = parheateqn(k, n, Ts, L, c) ms = L / n; if (Ts>(ms^2/2/c)), error('Selected time step is too large.'); end Uinit = initialTempDistrib(n); parts = codistributor1d.defaultPartition(n+2); numLocalCols = parts(labindex); leftColInd = sum(parts(1:labindex - 1)) + 1; rightColInd = leftColInd + numLocalCols - 1; U = Uinit(:, leftColInd:rightColInd); if (labindex > 1), U = [zeros(n+2, 1) U]; end if (labindex < numlabs), U = [U zeros(n+2, 1)]; end if (labindex == 1) || (labindex == numlabs) numLocalCols = numLocalCols - 1; end rightNeighbor = mod(labindex, numlabs) + 1; leftNeighbor = mod(labindex - 2, numlabs) + 1; north = 1:n; south = 3:n + 2; currRow = 2:n + 1; currCol = 2:numLocalCols + 1; east = 3:numLocalCols + 2; west = 1:numLocalCols; EECE5640 SP2016

Example #4: Parallel Heat Equation (cont.)
for iter = 1:k rightBoundary = labSendReceive(leftNeighbor,rightNeighbor,U(:,2)); leftBoundary = labSendReceive(rightNeighbor,leftNeighbor,U(:,end-1)); if (labindex > 1), U(:, 1) = leftBoundary; end if (labindex < numlabs), U(:, end) = rightBoundary; end % Update grid for current iteration U(currRow,currCol) = U(currRow,currCol) c*Ts/(ms^2)*(U(north,currCol) + U(south,currCol) *U(currRow,currCol) + U(currRow,east) + U(currRow,west)); end % Combine parts from all labs into a single matrix stored on lab 1 U = gcat(U(currRow, currCol), 2, 1); function U = initialTempDistrib(n) U = 23*ones(n + 2); U(1, :) = (1:(n + 2))*700/(n + 2); U(end, :) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); U(:, 1) = (1:(n + 2))*700/(n + 2); U(:, end) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2); EECE5640 SP2016

Summary: Problem Types
Interactive Batch Task- Parallel parpool parfor createJob createTask Data- spmd createCommunicatingJob EECE5640 SP2016

Increasing Scale using Multiple Systems with MDCS
EECE5640 SP2016

MDCS One-Time Setup on the Discovery Cluster
Load the required modules for MATLAB R2013b. > module whatis matlab_dce_2013b Best practice is to add the following lines to your ~/.bashrc file (if they’re not already in there): module load gnu-4.4-compilers module load fftw module load platform-mpi module load oracle_java_1.7u40 module load matlab_dce_2013b Then, log out and log back in to the Discovery Cluster to effect your changes. Copy .matlab directory to your home directory. > cp -R /shared/apps/matlab/matlab-2013b/env_script/.matlab ~/. Get a compute node on the ht-10g queue interactively. > bsub –Is –n 2 –q ht-10g /bin/bash Output like: <<Starting on compute-0-007>> Verify that the proper modules have been loaded. > module list Run MATLAB with no display to verify that you have MATLAB installed correctly. > matlab –logfile ./output.txt –dmlworker –nodisplay –r “ver;exit” The terminal output shows MATLAB start, display all its product versions, and then exit. If you’re done on the compute node, exit out of it. > exit Source: EECE5640 SP2016

Running an MDCS Submit Script on the Discovery Cluster
Create a Platform LSF submit script called “bsub_parfor.bash” with the following content: #!/bin/bash #BSUB –L /bin/bash #BSUB –J BensParforJob.01 #BSUB –q ht-10g #BSUB –o %J.out #BSUB –e %J.err #BSUB –n 9 work=/home/drozdenko.b/hpc/matlab_dcs_test MATLAB_infile=parfor_parallel cd $work matlab –logfile ./output.txt –nodisplay –r $MATLAB_infile Always set –n to one more than the number of MATLAB worker threads your code expects. Submit the job and check your job and output. > bsub < bsub_parfor.bash > bjobs –w Output is something like: JOBID 36768/USER drozdenko.b/STAT RUN… > bpeek (if bjobs shows that your job’s status is still running) > cat output.txt (once bjobs shows that your job is finished) EECE5640 SP2016

Running Jobs in Batch Mode from MATLAB GUI on the Discovery Cluster
Start an interactive session with X11-forwarding on the ht-10g queue. > bsub –Is -XF –n 1 –q ht-10g /bin/bash Output is something like: <<Starting on compute-0-006>> Ensure that the modules needed for MATLAB are loaded > module list Run MATLAB > matlab & Configure cluster profile settings. >> configCluster('discovery'); >> ClusterInfo.setQueueName('ht-10g') >> ClusterInfo.setProcsPerNode(16) Submit batch SPMD job using the batch function. >> j = batch('spmd_parallel','matlabpool',16); Wait for job to finish. Check the diary. Fetch the outputs. >> j.State >> j.wait >> j.diary >> out = j.fetchOutputs{:} EECE5640 SP2016

GPU Setup on Discovery Cluster
Load required CUDA modules (in addition to already loaded MATLAB R2013b modules). > module whatis cuda-5.5 Best to add the following lines to your ~/.bashrc file (if they’re not already in there): module load gnu-4.4-compilers module load fftw module load platform-mpi module load cuda-5.5 Start an interactive session with X11-forwarding on the par-gpu queue. > bsub –Is -XF –n 1 –q par-gpu-2 /bin/bash Output: <<Starting on compute-2-160>> Run MATLAB. > matlab & Confirm that you are connected to a GPU device. >> gpuDeviceCount MATLAB command line output should be “ans = 1”. Get GPU device information: >> d = gpuDevice Output shows properties in table at right (among others): Source: GPU Property Value Name Tesla K20m/40m ComputeCapability 3.5 SupportsDouble 1 MaxThreadsPerBlock 1024 MaxShmemPerBlock 49152 MaxThreadBlockSize [ ] MaxGridSize [2.1475e ] EECE5640 SP2016

Run Built-in Functions on GPU from MATLAB GUI on the Discovery Cluster
Try running and timing the MATLAB built-in function FFT with an array of 10 million random doubles. n = 1e7; r = rand(n,1); rf = fft(r); Output is like: Elapsed time is seconds. Next, put the array on the GPU and run the GPU version of the FFT function on the same array. When you’re done with the GPUArray data, use the gather() function to transfer it back to your local workspace. tic; g=gpuArray(r); gf=fft(g); gg=gather(gf); toc; Output is like: Elapsed time is seconds. Note that the GPU version runs slightly faster with an array of this size. Refer to MathWorks documentation for the latest list of built-in functions: EECE5640 SP2016

Run CUDA PTX files on GPU from MATLAB GUI on the Discovery Cluster
Create a CUDA C kernel function. This add2 function adds two double vectors. __global__ void add2(double *v1, const double *v2) { int idx = threadIdx.x; v1[idx] += v2[idx]; } Next, compile the CUDA C kernel function using nvcc, producing only the .PTX file. > nvcc -ptx gpufcn.cu In the MATLAB GUI, create a CUDAKernel object and set its properties. >> k=parallel.gpu.CUDAKernel('gpufcn.ptx','gpufcn.cu','add'); >> k.ThreadBlockSize = 128; Call the feval function to run the CUDA kernel with gpuArray data. >> x1 = gpuArray(rand(n,1)); >> x2 = gpuArray(rand(n,1)); >> y = feval(k,x1,x2); >> yg = gather(y); Refer to MathWorks documentation for more detailed instructions: EECE5640 SP2016

Cleanup on Discovery Cluster
Close any interactive MATLAB GUI windows. >> exit Check to see if you have any other processes still running on each compute node. > ps Ignore ps and bash. For other listed processes, use the kill command. > kill 29777 Exit each interactive session on a compute node. > exit (each compute node) Check to see if you have any remaining jobs still running or queued. >> bjobs –w Ignore jobs on a discovery login node where JOB_NAME is /bin/bash. For all other listed jobs, use the bkill command. >> bkill When finished, exit each interactive session and exit Discovery. > exit (the Discovery cluster) EECE5640 SP2016 EECE5640 SP2015

Conclusion Use Parallel Computing Toolbox on your local machine to prototype your parallel algorithms. Use parfor to enact task-parallel algorithms (like with OpenMP) and spmd to enact data-parallel algorithms (like with MPI). Move your parallel algorithms onto the Discovery Cluster to see significant speedup using MDCS. You can also perform GPU computing on the Discovery Cluster using gpuArray’s, built-in functions, CUDA PTX files, and gather. Questions? EECE5640 SP2016

Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High.

Similar presentations

Presentation on theme: "Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High.

Similar presentations

Presentation on theme: "Parallel Computing with MATLAB® How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High."— Presentation transcript:

Similar presentations

About project

Feedback