MATLAB HPCS Extensions

MATLAB HPCS Extensions
Presented by: David Padua University of Illinois at Urbana-Champaign

Senior Team Members Gheorghe Almasi - IBM Research
Calin Cascaval - IBM Research Basilio Fraguela - University of Illinois Jose Moreira - IBM Research David Padua - University of Illinois

Objectives To develop MATLAB extensions for accessing, prototyping, and implementing scalable parallel algorithms. To give programmers of high-end machines access to all the powerful features of MATLAB, as a result. Array operations / kernels. Interactive interface. Rendering.

Goals of the Design Minimal extension.
A natural extension to MATLAB that is easy to use. Extensions for direct control of parallelism and communication on top of the ability to access parallel library routines. It does not seem it is possible to encapsulate all the important parallelism in library routines. Extensions that provide the necessary information and can be automatically and effectively analyzed for compilation and translation.

Uses of the Language Extension
Parallel libraries user. Parallel library developer. Input to type-inference compiler. No existing MATLAB extension had the necessary characteristics.

Approach In our approach, the programmer interacts with a copy of MATLAB running on a workstation. The workstation controls parallel computation on servers.

Approach (Cont.) All conventional MATLAB operations are executed on the workstation. The parallel server operates on a new class of MATLAB objects: hierarchically tiled arrays (HTAs). HTAs are implemented as a MATLAB toolbox. This enables implementation as a language extension and simplifies porting to future versions of MATLAB.

Hierarchically Tiled Arrays
Array tiles are a powerful mechanism to enhance locality in sequential computations and to represent data distribution across parallel systems. Our main extension to MATLAB is arrays that are hierarchically tiled. Several levels of tiling are useful to distribute data across parallel machines with a hierarchical organization and to simultaneously represent both data distribution and memory layout. For example, a two-level hierarchy of tiles can be used to represent: the data distribution on a parallel system and the memory layout within each component.

Hierarchically Tiled Arrays (Cont.)
Computation and communication are represented as array operations on HTAs. Using array operations for communication and computation raises the level of abstraction and, at the same time, facilitates optimization.

Using HTAs for Locality Enhancement
Tiled matrix multiplication using conventional arrays for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 C(i,j)=C(i,j)+A(i,k)*B(k,j); end

Using HTAs for Locality Enhancement
Tiled matrix multiplication using HTAs Here, C{i,j}, A{i,k}, B{k,j} represent submatrices. The * operator represents matrix multiplication in MATLAB. for i=1:m for j=1:m for k=1:m C{i,j}=C{i,j}+A{i,k}*B{k,j}; end

Using HTAs to Represent Data Distribution and Parallelism
Cannon’s Algorithm A{1,1} B{1,1} A{1,2} B{2,2} A{1,3} B{3,3} A{1,4} B{4,4} A{2,2} B{2,1} A{2,3} B{3,2} A{2,4} B{4,3} A{2,1} B{1,4} A{3,3} B{3,1} A{3,4} B{4,2} A{3,1} B{1,3} A{3,2} B{2,4} A{4,4} B{4,1} A{4,1} B{1,2} A{4,2} B{2,3} A{4,3} B{3,4}

A{1,1} B{1,1} A{1,2} B{2,2} A{1,3} B{3,3} A{1,4} B{4,4} A{2,2} B{2,1} A{2,3} B{3,2} A{2,4} B{4,3} A{2,1} B{1,4} A{3,3} B{3,1} A{3,4} B{4,2} A{3,1} B{1,3} A{3,2} B{2,4} A{4,4} B{4,1} A{4,1} B{1,2} A{4,2} B{2,3} A{4,3} B{3,4}

A{1,2} B{1,1} A{1,3} B{2,2} A{1,4} B{3,3} A{1,1} B{4,4} A{2,3} B{2,1} A{2,4} B{3,2} A{2,1} B{4,3} A{2,2} B{1,4} A{3,4} B{3,1} A{3,1} B{4,2} A{3,2} B{1,3} A{3,3} B{2,4} A{4,1} B{4,1} A{4,2} B{1,2} A{4,3} B{2,3} A{4,4} B{3,4}

A{1,2} B{2,1} A{1,3} B{3,2} A{1,4} B{4,3} A{1,1} B{1,4} A{2,3} B{3,1} B{4,2} A{2,4} B{1,3} A{2,1} B{2,4} A{3,3} B{4,1} A{3,4} B{1,2} A{3,1} B{2,3} A{3,2} B{3,4} A{4,4} B{1,1} A{4,1} B{2,2} A{4,2} B{3,3} A{4,3} B{4,4}

Cannnon’s Algorithm in MATLAB with HPCS Extensions
C{1:n,1:n} = zeros(p,p); %communication … for k=1:n C{:,:} = C{:,:}+A{:,:}*B{:,:}; %computation forall i=1:n A{i,1:n} = A{i,[2:n, 1]}; %communication B{1:n,i} = B{[2:n,1],i}; %communication end

Cannnon’s Algorithm in C + MPI
for (km = 0; km < m; km++) { char *chn = "T"; dgemm(chn, chn, lclMxSz, lclMxSz, lclMxSz, 1.0, a, lclMxSz, b, lclMxSz, 1.0, c, lclMxSz); MPI_Isend(a, lclMxSz * lclMxSz, MPI_DOUBLE, destrow, ROW_SHIFT_TAG, MPI_COMM_WORLD, &requestrow); MPI_Isend(b, lclMxSz * lclMxSz, MPI_DOUBLE, destcol, COL_SHIFT_TAG, MPI_COMM_WORLD, &requestcol); MPI_Recv(abuf, lclMxSz * lclMxSz, MPI_DOUBLE, MPI_ANY_SOURCE, ROW_SHIFT_TAG, MPI_COMM_WORLD, &status); MPI_Recv(bbuf, lclMxSz * lclMxSz, MPI_DOUBLE, MPI_ANY_SOURCE, COL_SHIFT_TAG, MPI_COMM_WORLD, &status); MPI_Wait(&requestrow, &status); aptr = a; a = abuf; abuf = aptr; MPI_Wait(&requestcol, &status); bptr = b; b = bbuf; bbuf = bptr; }

Speedups on a four-processor IBM SP-2

Speedups on a nine-processor IBM SP-2

The SUMMA Algorithm (1 of 6)
Use now the outer-product method (n2-parallelism) Interchanging the loop headers of the loop in Example 1 produce: for k=1:n for i=1:n for j=1:n C{i,j}=C{i,j}+A{i,k}*B{k,j} end To obtain n2 parallelism, the inner two loops should take the form of a block operations: for k=1:n C{:,:}=C{:,:}+A{:,k}  B{k,:}; Where the operator  represents the outer product operations

C A B b b12 Switch Orientation -- By using a column of A and a row of B broadcast to all, compute the “next” terms of the dot product a11 a21 a11b11 a11b12 a21b11 a21b12

c{1:n,1:n} = zeros(p,p); % communication for i=1:n t1{:,:}=spread(a(:,i),dim=2,ncopies=N); % communication t2{:,:}=spread(b(i,:),dim=1,ncopies=N); % communication c{:,:}=c{:,:}+t1{:,:}*t2{:,:}; % computation end

Matrix Multiplication with Columnwise Block Striped Matrix
for i=0:p-1 block(i) =[ i*n/p : (i-1)*n/p-1 ]; end M{i} = A(:,block(i)); v{i} = b(block(i)); for i=0:p-1 forall j=0:p-1 c{j}{i} =M{i}(block(i),:)*v{j}(:); forall i=0:p-1;j=0:p-1 d{i}{j} = c{j}{i} forall i=0:p-1 b{i} = sum(d{i}{:})

Flattening Elements of an HTA are referenced using a tile index for each level in the hierarchy followed by an array index. Each tile index tuple is enclosed within {}s and the array index is enclosed within parentheses. In the matrix multiplication code, C{i,j}(3,4) would represent element 3,4 of submatrix i,j. Alternatively, the tiled array could be accessed as a flat array as shown in the next slide. This feature is useful when a global view of the array is needed in the algorithm. It is also useful while transforming a sequential code into parallel form.

Two Ways of Referencing the Elements of an 8 x 8 Array.

Status We have completed the implementation of practically all of our initial language extensions (for IBM SP-2 and Linux Clusters). Following the toolbox approach has been a challenge, but we have been able to overcome all obstacles.

Conclusions We have developed parallel extensions to MATLAB.
It is possible to write highly readable parallel code for both dense and sparse computations with these extensions. The HTA objects and operations have been implemented as a MATLAB toolbox which enabled their implementation as language extensions.

MATLAB HPCS Extensions

Similar presentations

Presentation on theme: "MATLAB HPCS Extensions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MATLAB HPCS Extensions

Similar presentations

Presentation on theme: "MATLAB HPCS Extensions"— Presentation transcript:

Similar presentations

About project

Feedback