Parallel Computing in Matlab

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

MATLAB Parallel Computing Toolbox A.Hosseini Course : Professional Architecture.
Intermediate GPGPU Programming in CUDA
Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed.
Introduction to Matlab
Programmability Issues
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Parallel Computing in Matlab An Introduction. Overview Offload work from client to workers Run as many as eight workers (newest version) Can keep client.
28.2 Functionality Application Software Provides Applications supply the high-level services that user access, and determine how users perceive the capabilities.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Chapter 6: User-Defined Functions I
C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 6: User-Defined Functions I.
Chapter 6: User-Defined Functions I
Chapter 11 ASP.NET JavaScript, Third Edition. 2 Objectives Learn about client/server architecture Study server-side scripting Create ASP.NET applications.
PRASHANTHI NARAYAN NETTEM.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Parallel Computing with MATLAB
Client Server Model and Software Design TCP/IP allows a programmer to establish communication between two application and to pass data back and forth.
Installing software on personal computer
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Matlab ® Distributed Computing Server CBI Laboratory Release: R2012a Sept. 28, 2012 By: CBI Development Team.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
1 Functions 1 Parameter, 1 Return-Value 1. The problem 2. Recall the layout 3. Create the definition 4. "Flow" of data 5. Testing 6. Projects 1 and 2.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Parallel Computing with MATLAB Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 24,
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Advanced / Other Programming Models Sathish Vadhiyar.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
GPU Architecture and Programming
Advanced Topics- Functions Introduction to MATLAB 7 Engineering 161.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February Session 11.
Parallel Computing with MATLAB Jemmy Hu SHARCNET HPC Consultant University of Waterloo Feb. 1, 2012.
Chapter 6 Review: User Defined Functions Introduction to MATLAB 7 Engineering 161.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 6: User-Defined Functions I.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Chapter 3: User-Defined Functions I
Parallelization of a Swarm Intelligence System
C++ Programming: From Problem Analysis to Program Design, Fourth Edition Chapter 6: User-Defined Functions I.
11 Computers, C#, XNA, and You Session 1.1. Session Overview  Find out what computers are all about ...and what makes a great programmer  Discover.
Introduction to Literate Programming in Matlab 2WN50 – Week programming-in-matlab.pptx?dl=0.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Parallel Computing with MATLAB Modified for 240A UCSB Based on Jemmy Hu University of Waterloo
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Chapter 6: User-Defined Functions I
Repetition Structures Chapter 9
Topics Introduction to Repetition Structures
Spark Presentation.
User-Defined Functions
User Defined Functions
Oct. 27, By: CBI Development Team
Shared Memory Programming
MATLAB Tutorial ECE 002 Professor S. Ahmadi.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Presentation transcript:

Parallel Computing in Matlab

PCT Parallel Computing Toolbox Offload work from one MATLAB session (the client) to other MATLAB sessions (the workers). Run as many as eight MATLAB workers (R2010b) on your local machine in addition to your MATLAB client session. 推荐一核不超过一个worker

MDCS MATLAB Distributed Computing Server Run as many MATLAB workers on a remote cluster of computers as your licensing allows. Run workers on your client machine if you want to run more than eight local workers (R2010b). Scheduler/job manager: 专门负责任务分配。

MDCS installing

Typical Use Cases Parallel for-Loops Batch Jobs Large Data Sets Many iterations Long iterations Batch Jobs Large Data Sets Batch Jobs When working interactively in a MATLAB session, you can offload work to a MATLAB worker session to run as a batch job. The command to perform this job is asynchronous, which means that your client MATLAB session is not blocked, and you can continue your own interactive session while the MATLAB worker is busy evaluating your code. The MATLAB worker can run either on the same machine as the client, or if using MATLAB Distributed Computing Server, on a remote cluster machine.

Parfor Parallel for-loop Has the same basic concept with “for”. Parfor body is executed on the MATLAB client and workers. The necessary data on which parfor operates is sent from the client to workers, and the results are sent back to the client and pieced together. MATLAB workers evaluate iterations in no particular order, and independently of each other.

Parfor A = zeros(1024, 1); for i = 1:1024 A(i) = sin(i*2*pi/1024); end plot(A) parallelization A = zeros(1024, 1); matlabpool open local 4 parfor i = 1:1024 A(i) = sin(i*2*pi/1024); end matlabpool close plot(A)

Timing A = zeros(n, 1); tic for i = 1:n A(i) = sin(i); end toc A = zeros(n, 1); matlabpool open local 8 tic parfor i = 1:n A(i) = sin(i); end toc n for parfor 10000 0.003158 0.040542 1000000 0.080678 0.221070 100000000 23.161180 14.230125

When to Use Parfor? Each loop must be independent of other loops. Lots of iterations of simple calculations. or Long iterations. Small number of simple calculations.

Classification of Variables broadcast variable sliced input variable loop variable reduction variable sliced output variable temporary variable Temporary variable: parfor结束后数据销毁 Loop variable: parfor结束后值为0; Sliced variable: 可对其进行并行操作。 Reduction variable: In a parfor-loop, the value of z is never transmitted from client to workers or from worker to worker. Rather, additions of i are done in each worker, with i ranging over the subset of 1:n being performed on that worker. The results are then transmitted back to the client, which adds the workers' partial sums into z. Thus, workers do some of the additions, and the client does the rest.

More Notes d = 0; i = 0; for i = 1:4 b = i; d = i*2; A(i)= d; end parfor i = 1:4 b = i; d = i*2; A(i)= d; end A [2,4,6,8] d 8 i 4 b A [2,4,6,8] d i b / A(i): slice output variable d, b: temporary variable i: loop variable 变量可以从client传递到worker中用,但并不能改变此变量的值,循环结束此变量值不变;但Parfor内定义的临时变量循环结束后就消失了(如在parfor外不定义d = 0,结束后d变量不存在)。

More Notes How to parallelize? C = 0; for i = 1:m for j = i:n C = C + i * j; end How to parallelize? C: reduction variable

Parfor: Estimating an Integral

Parfor: Estimating an Integral function q = quad_fun( m, n, x1, x2, y1, y2 ) q = 0.0; u = (x2 - x1)/m; v = (y2 - y1)/n; for i = 1:m x = x1 + u * i; for j = 1:n y = y1 + v * j; fx = x^2 + y^2; q = q + u * v * fx; end

Parfor: Estimating an Integral Computation complexity: O(m*n) Each iteration is independent of other iterations. We can replace “for” with “parfor”, for either loop index i or loop index j.

Parfor: Estimating an Integral function q = quad_fun( m, n, x1, x2, y1, y2 ) q = 0.0; u = (x2 - x1)/m; v = (y2 - y1)/n; parfor i = 1:m x = x1 + u * i; for j = 1:n y = y1 + v * j; fx = x^2 + y^2; q = q + u * v * fx; end tic A = quad_fun(m,n,0,3,0,3); toc Why (1000,1000) takes less time than (100,100)? It doesn’t, really! How can "1+1" take longer than "1+0"? (It does, but it's probably not as bad as it looks!) Parallelism doesn't pay until your problem is big enough; Parallelism doesn't pay until you have a decent number of workers. (m, n) 1 + 0 1 + 1 1 + 2 1 + 3 1 + 4 (100, 100) 0.005 0.255 0.087 0.101 0.114 (1000, 1000) 0.035 0.066 0.046 0.045 0.053 (10000, 10000) 3.123 1.626 1.143 0.883 (100000, 100000) 308.282 309.926 157.393 108.819 85.185

Parfor: Estimating an Integral function q = quad_fun( m, n, x1, x2, y1, y2 ) q = 0.0; u = (x2 - x1)/m; v = (y2 - y1)/n; for i = 1:m x = x1 + u * i; parfor j = 1:n y = y1 + v * j; fx = x^2 + y^2; q = q + u * v * fx; end tic A = quad_fun(m,n,0,3,0,3); toc (m, n) 1 + 0 1 + 1 1 + 2 1 + 3 1 + 4 (100, 100) 0.005 1.754 1.975 2.126 2.612 (1000, 1000) 0.035 13.146 15.286 18.661 22.313 (10000, 10000) 3.123 113.368 139.568 178.425 220.155 (100000, 100000) 308.282

SPMD SPMD: Single Program Multiple Data. SPMD command is like a very simplified version of MPI. The spmd statement lets you define a block of code to run simultaneously on multiple labs, each lab can have different, unique data for that code. Labs can communicate directly via messages, they meet at synchronization points. The client program can examine or modify data on any lab.

SPMD Statement

SPMD Statement

SPMD MATLAB sets up the requested number of labs, each with a copy of the program. Each lab “knows" it's a lab, and has access to two special functions: numlabs(), the number of labs; labindex(), a unique identifier between 1 and numlabs().

SPMD

Distributed Arrays Distributed() You can create a distributed array in the MATLAB client, and its data is stored on the labs of the open MATLAB pool. A distributed array is distributed in one dimension, along the last nonsingleton dimension, and as evenly as possible along that dimension among the labs. You cannot control the details of distribution when creating a distributed array. Distributed array: 分布式矩阵 Distributed()函数可用于将client中定义的矩阵,分布到各个lab中。分布方式只能沿一个维度分开,默认竖直方向分开,一般尽量平均分配在各个lab中,和parfor一样,不能控制分布的具体细节。 W在逻辑上仍未一个完整的矩阵,但实际上是分块儿存储在不同的lab中的。

Distributed Arrays Codistributed() You can create a codistributed array by executing on the labs themselves, either inside an spmd statement, in pmode, or inside a parallel job. When creating a codistributed array, you can control all aspects of distribution, including dimensions and partitions. Codistributed()函数把labs中存储的相同的矩阵变量分布在各个lab中,节约存储空间。

Distributed Arrays Codistributed() You can create a codistributed array by executing on the labs themselves, either inside an spmd statement, in pmode, or inside a parallel job. When creating a codistributed array, you can control all aspects of distribution, including dimensions and partitions.

Example: Trapezoid Trapezoid: 梯形

Example: Trapezoid To simplify things, we assume interval is [0, 1] , and we'll let each lab define a and b to mean the ends of its subinterval. If we have 4 labs, then lab number 3 will be assigned [ ½, ¾].

Example: Trapezoid

Parallel computing synchronously Pmode pmode lets you work interactively with a parallel job running simultaneously on several labs. Commands you type at the pmode prompt in the Parallel Command Window are executed on all labs at the same time. Each lab executes the commands in its own workspace on its own variables. Pmode每个lab都有一个窗口,你可以输入命令,看到在每个lab中的运行结果,进入lab的workspace. Spmd结束后其中的数据和信息都还存在,可以重新进入使用;pmode退出后,作业销毁,里面的数据就都没了,重新开启是一个新的开始。 The way the labs remain synchronized is that each lab becomes idle when it completes a command or statement, waiting until all the labs working on this job have completed the same statement. Only when all the labs are idle, do they then proceed together to the next pmode command. pmode spmd Parallel computing synchronously Each lab has a desktop No desktop for labs Can’t freely interleave serial and parallel work Can freely interleave serial

Pmode

Pmode labindex() and numlabs() still work; Variables only have the same name, they are independent of each other.

Pmode Aggregate the array segments into a coherent array. codist = codistributor1d(2, [2 2 2 2], [3 8]) whole = codistributed.build(segment, codist) Codistributor1d: 1-D distribution scheme for codistributed array codistributed.build为构造函数

Pmode Aggregate the array segments into a coherent array. whole = whole + 1000 section = getLocalPart(whole) getLocalPart可以获取大矩阵分布在各个lab的小矩阵

Pmode Aggregate the array segments into a coherent array combined = gather(whole) Gather()把分布在lab中的分布式阵列整合在一起输出在client中。

Pmode How to change distribution? distobj = codistributor1d() I = eye(6, distobj) getLocalPart(I) distobj = codistributor1d(1); I = redistribute(I, distobj) getLocalPart(I)

GPU Computing Capabilities Requirements Transferring data between the MATLAB workspace and the GPU Evaluating built-in functions on the GPU Running MATLAB code on the GPU Creating kernels from PTX files for execution on the GPU Choosing one of multiple GPU cards to use Requirements NVIDIA CUDA-enabled device with compute capability of 1.3 or greater NVIDIA CUDA device driver 3.1 or greater NVIDIA CUDA Toolkit 3.1 (recommended) for compiling PTX files

GPU Computing Transferring data between workspace and GPU Creating GPU data N = 6; M = magic(N); G = gpuArray(M); M2 = gather(G);

result = arrayfun(@myFunction, arg1, arg2); GPU Computing Executing code on the GPU You can transfer or create data on the GPU, and use the resulting GPUArray as input to enhanced built-in functions that support them. You can run your own MATLAB function file on a GPU. If any of arg1 and arg2 is a GPUArray, the function executes on the GPU and return a GPUArray If none of the input arguments is GPUArray, then arrayfun executes in CPU. Only element-wise operations are supported. result = arrayfun(@myFunction, arg1, arg2); Arrayfun: apply function to each element of array, not specified for GPU.

Review What is the typical use cases of parallel Matlab? When to use parfor? What’s the difference between worker(parfor) and lab(spmd)? What’s the difference between spmd and pmode? How to build distributed array? How to use GPU for Matlab parallel computing?