Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi

Programming for Parallelism and Locality with Hierarchically Tiled Arrays
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi*, Basilio B.Fraguela+, Maria J. Garzaran, David Padua, Christoph von Praun* University of Illinois,Urbana-Champaign, *IBM T. J. Watson Research Center, +Universidade de Coruna

Motivation Memory Hierarchy
Determines performance Increasingly complex Registers – Cache (L1..L3) – Memory - Distributed Memory Gap between the programs and their mapping into the memory hierarchy Given the importance of memory hierarchy, programming constructs that allow the control of locality are necessary. Exchange points 1 & 2; gap between algo and mapping impl into hardware; programming the mem hierarchy Find out how the data flows from registers to cache to memory to distributed memory There are some algo’s that are better described in term of tiles. 1/12/2019 HTA

Contributions Design of a programming paradigm that facilitates
control of locality (in sequential and parallel programs) communication (in parallel programs) Implementations in MATLAB, C++ and X10 Implementation of NAS benchmarks in our paradigm Evaluation of productivity and performance 1. Programming paradigm for explicit memory hierarchy. 2. Impl of it. 3. Impl of benchmarks and verify. 4. To evaluate the performance 1/12/2019 HTA

Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA

Hierarchically Tiled Arrays (HTA)
Tiled Array: Array partitioned into tiles Hierarchically Tiled Array: Tiled Array whose tiles are in-turn HTAs. 1/12/2019 HTA

Hierarchically Tiled Arrays
An example tiling 2 X 2 tiles Distributed/Locality 4 X 4 tiles 2 X 2 tiles 1/12/2019 HTA

An example size 2 X 2 tiles size = 64 X 64 elements 4 X 4 tiles size = 32 X 32 elements Distributed/Locality 2 X 2 tiles size = 8 X 8 elements 1/12/2019 HTA

An example mapping 2 X 2 tiles size = 64 X 64 elements map to distinct modules of a cluster 4 X 4 tiles size = 32 X 32 elements map to L1-cache Distributed/Locality 2 X 2 tiles size = 8 X 8 elements map to registers 1/12/2019 HTA

Why tiles as first class objects?
Many algorithms are formulated in terms of tiles e.g. linear algebra algorithms (in LAPACK, ATLAS) What is special about HTAs, in particular? HTA provides a convenient way for the programmer to express these types of algorithms Its recursive nature makes it easy to fit in to any memory hierarchy. 1/12/2019 HTA

HTA Construction hta(MATRIX,{[delim1],[delim2],…},[processor mesh]) a
mapping P P 1 3 ha 1 3 P P 1 2 4 Explain the example. ha = hta(a,{[1,4],[1,3]},[2,2]); 4 1/12/2019 HTA

HTA Addressing 1/12/2019 HTA

HTA Addressing tiles h{1,1:2} h{2,1} hierarchical 1/12/2019 HTA

HTA Addressing h{1,1:2} h{2,1} h(3,4) elements hierarchical flattened
1/12/2019 HTA

HTA Addressing h{1,1:2} h{2,1} h(3,4) h{2,2}(1,2) hierarchical
flattened or h{2,2}(1,2) hybrid 1/12/2019 HTA

F90 Conformability size(rhs) == 1 dim(lhs) == dim(rhs) .and.
size(lhs) == size(rhs) 1/12/2019 HTA

HTA Conformability All conformable HTAs can be operated using
the primitive operations (add, subtract, etc) 1/12/2019 HTA

HTA Assignment t h h{:,:} = t{:,:} h{1,:} = t{2,:}
Mention what is triplet, : means all Introduce conformability earlier implicit communication 1/12/2019 HTA

Data parallel functions in F90
1/12/2019 HTA

Map r = map h) @sin Recursive 1/12/2019 HTA

Map r = map h) @sin @sin @sin @sin @sin Recursive 1/12/2019 HTA

Map r = map (@sin, h) @sin @sin @sin @sin @sin @sin @sin @sin @sin
Recursive 1/12/2019 HTA

Reduce r = reduce (@max, h) @max @max @max @max @max @max @max @max
Recursive 1/12/2019 HTA

Reduce r = reduce (@max, h) @max @max @max @max @max Recursive
1/12/2019 HTA

Higher level operations
repmat(h, [1, 3]) circshift( h, [0, -1] ) Explain communication transpose(h) 1/12/2019 HTA

Blocked Recursive Matrix Multiplication
Blocked-Recursive implementation A{i, k}, B{k, j} and C{i, j} are sub-matrices of A, B and C. function c = matmul (A, B, C) if (level(A) == 0) C = matmul_leaf (A, B, C) else for i = 1:size(A, 1) for k = 1:size(A, 2) for j = 1:size(B, 2) C{i, j} = matmul(A{i, k}, B{k, j}, C{i,j}); end 1/12/2019 HTA

Matrix Multiplication
matmul (A, B, C) matmul (A{i, k}, B{k, j}, C{i, j}) matmul (AA{i, k}, BB{k, j}, CC{i, j}) Distributed/Locality matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j}) 1/12/2019 HTA

Matrix Multiplication
matmul (A, B, C) matmul (A{i, k}, B{k, j}, C{i, j}) matmul (AA{i, k}, BB{k, j}, CC{i, j}) Distributed/Locality matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j}) for i = 1:size(A, 1) for k = 1:size(A, 2) for j = 1:size(B, 2) C(i,j)= C(i,j) + A(i, k) * B(k, j); end 1/12/2019 HTA

SUMMA Matrix Multiplication
Outer-product method for k=1:n for i=1:n for j=1:n C(i,j)=C(i,j)+A(i,k)*B(k,j); end for k=1:n C(:,:)=C(:, :)+A{:,k} * B(k, :); 1/12/2019 HTA

SUMMA Matrix Multiplication
Here, C{i,j}, A{i,k}, B{k,j} represent submatrices. The * operator represents tile-by-tile matrix multiplication function C = summa (A, B, C) for k=1:m T1 = repmat(A{:, k}, 1, m); T2 = repmat(B{k, :}, m, 1); C = C + T1 * T2; end B repmat repmat A * T2 T1 1/12/2019 HTA

Implementation MATLAB Extension with HTA X10 Extension
Global view & single threaded Front-end is pure MATLAB MPI calls using MEX interface MATLAB library routines at the leaf level X10 Extension Emulation only (no experimental results) C++ extension with HTA (Under-progress) Communication using MPI (& UPC) 1/12/2019 HTA

Evaluation Implemented the full set of NAS benchmarks (except SP)
Pure MATLAB & MATLAB + HTA Performance metrics: running time & relative speedup on processors Productivity metric: source lines of code Compared with hand-optimized F77+MPI version Tiles only used for parallelism Matrix-matrix multiplication using HTAs in C++ Tiles used for locality 1/12/2019 HTA

Illustrations

MG 3D Stencil convolution: primitive HTA operations and
assignments to implement communication Sample code?? restrict interpolate 1/12/2019 HTA

MG r{1,:}(2,:) = r{2,:}(1,:); communication
r {:, :}(2:n-1, 2:n-1) = 0.5 * u{:,:}(2:n-1, 2:n-1) * (u{:, :}(1:n-2, 2:n-1) + u{:,:}(3:n, 2:n-1) u{:,:}(2:n-1, 1:n-2) + u{:,:}(2:n-1, 3:n)) n = 3 computation 1/12/2019 HTA

CG Sparse matrix-vector multiplication (A*p) 2D decomposition of A
replication and 2D decomposition of p sum reduction across columns p0T p1T p2T p3T A00 A01 A02 A03 A10 A11 A12 A13 Say the tiles are distributed to different processors Vector is transposed * A20 A21 A22 A23 A30 A31 A32 A33 1/12/2019 HTA

CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); 2 1 2 p0T p1T p2T p3T p0T p1T p2T p3T * A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 p0T p1T p2T pT3 repmat 1/12/2019 HTA

1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); p = map p) 2 3 3 A00 A01 A02 A03 A10 A11 A12 A13 p0 p1 p2 p3 p0 p1 p2 p3 * A20 A21 A22 A23 A30 A31 A32 A33 p0 p1 p2 p3 repmat 1/12/2019 HTA

1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); p = map p) R = reduce ( 'sum', A * p, 2, true); 2 3 4 4 reduce R00 R01 R02 R03 R10 R11 R12 R13 R20 R21 R22 R23 R30 R31 R32 R33 A00 A01 A02 A03 A10 A11 A12 A13 p0 p1 p2 p3 p0 p1 p2 p3 = * A20 A21 A22 A23 A30 A31 A32 A33 p0 p1 p2 p3 repmat 1/12/2019 HTA

FT 3D Fourier transform Corner turn using dpermute
Fourier transform applied to each tile h = ft (h, 1); A 2D illustration h = dpermute (h, [2 1]); h = ft(h, 1); @ft @ft dpermute ft 1/12/2019 HTA

IS Bucket sort of integers Bucket exchange using assignments
Distribute keys Local bucket sort Aggregate Bucket exchange Local sort @sort @sort @sort 1/12/2019 HTA

IS Bucket sort of integers Bucket exchange assignments
h and r top level distributed for i = 1:n for j = 1:n r{i}{j} = h{j}{i} end @sort @sort @sort 1/12/2019 HTA

Experimental Results NAS benchmarks (MATLAB) (Results in the paper)
CG & FT – acceptable performance and speedup MG, LU, BT – poor performance, but fair speedup C++ results yet to be obtained. Add loc plot 1/12/2019 HTA

Matrix Multiplication (C++ implementation)
1/12/2019 HTA

Lines of Code Lines of code EP CG MG FT LU 1/12/2019 HTA

Conclusion HTA is the first attempt to make tiles part of a language
Treating tiling as a syntactic entity helps improve productivity Extending the vector operations to tiles makes the programs more readable HTA provides a convenient way to express algorithms with tiled formulation Future work C++ library + NAS benchmarks Scalability experiments Evolving the language for locality Other conclusions 1/12/2019 HTA

THE END 1/12/2019 HTA

Backup slides 1/12/2019 HTA

void(conj_grad (SparseHTA A, HTA x, Matrix z) {
z = 0.0; q = 0.0; /* Conjugate Gradient Benchmark in C++ */ r = x; p = r; rho = HTA::dotproduct(r, r); for(int i = 0; i < CGITMAX; i++) { qt = 0; pt = p.ltranspose(); SparseHTA::lmatmul(A, pt, &qt); /* compute A*x */ qt = qt.reduce(plus<double>(), 0.0, 1, 0x2); q = qt.transpose(); alpha = rho / HTA::dotproduct(p, q); z = z + alpha * p; r = r - alpha * q; rho0 = rho; rho = Matrix::dotproduct(r, r); beta = rho / rho0; p = r + beta * p; } qt = 0; /* compute the norm */ HTA zt = z.ltranspose(); SparseHTA::lmatmul(A, zt, &qt); r = qt.transpose(); HTA f = x - r; f = f * f; norm = f.reduce(plus<double>(), 0.0); rnorm = sqrt(norm); 1/12/2019 HTA

Related Works Languages Parallel MATLABs excluded Classification
ZPL, CAF, UPC, HPF, X10, CHAPEL Parallel MATLABs excluded Classification Global vs Segmented memory Single vs Multiple thread of execution Language vs Library 1/12/2019 HTA

Classification Global view Single threaded Library CAF Titanium UPC
X10 HPF ZPL POOMA HTA G.Arrays POET MPI/PVM Library 1/12/2019 HTA

Uni-processor performance challenges
MATLAB Limited conversion to vector format Interpretation costs due to loops Copy-on-write during assignments Poor data abstraction with higher dimensionality C++ Run time costs (e.g. virtual functions) Extensible, polymorphic design Syntax constraints leading to proxies More explanation in point 1 1/12/2019 HTA

Multi-processor performance challenges
Eliminating temps in expression evaluation e.g. stencil convolutions in MG Scalable communication operations e.g. reduction & transpose in CG e.g. permute in FT e.g. cumsum & bucket exchange in IS Many to one Mapping & complex mapping e.g. BT Overlap computation & communication e.g. LU More explanation 1/12/2019 HTA

Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi

Similar presentations

Presentation on theme: "Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi

Similar presentations

Presentation on theme: "Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi"— Presentation transcript:

Similar presentations

About project

Feedback