Download presentation
Presentation is loading. Please wait.
Published byEgbert Wade Modified over 6 years ago
1
Programming for Parallelism and Locality with Hierarchically Tiled Arrays
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi*, Basilio B.Fraguela+, Maria J. Garzaran, David Padua, Christoph von Praun* University of Illinois,Urbana-Champaign, *IBM T. J. Watson Research Center, +Universidade de Coruna
2
Motivation Memory Hierarchy
Determines performance Increasingly complex Registers – Cache (L1..L3) – Memory - Distributed Memory Gap between the programs and their mapping into the memory hierarchy Given the importance of memory hierarchy, programming constructs that allow the control of locality are necessary. Exchange points 1 & 2; gap between algo and mapping impl into hardware; programming the mem hierarchy Find out how the data flows from registers to cache to memory to distributed memory There are some algo’s that are better described in term of tiles. 1/12/2019 HTA
3
Contributions Design of a programming paradigm that facilitates
control of locality (in sequential and parallel programs) communication (in parallel programs) Implementations in MATLAB, C++ and X10 Implementation of NAS benchmarks in our paradigm Evaluation of productivity and performance 1. Programming paradigm for explicit memory hierarchy. 2. Impl of it. 3. Impl of benchmarks and verify. 4. To evaluate the performance 1/12/2019 HTA
4
Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA
5
Hierarchically Tiled Arrays (HTA)
Tiled Array: Array partitioned into tiles Hierarchically Tiled Array: Tiled Array whose tiles are in-turn HTAs. 1/12/2019 HTA
6
Hierarchically Tiled Arrays
An example tiling 2 X 2 tiles Distributed/Locality 4 X 4 tiles 2 X 2 tiles 1/12/2019 HTA
7
Hierarchically Tiled Arrays
An example size 2 X 2 tiles size = 64 X 64 elements 4 X 4 tiles size = 32 X 32 elements Distributed/Locality 2 X 2 tiles size = 8 X 8 elements 1/12/2019 HTA
8
Hierarchically Tiled Arrays
An example mapping 2 X 2 tiles size = 64 X 64 elements map to distinct modules of a cluster 4 X 4 tiles size = 32 X 32 elements map to L1-cache Distributed/Locality 2 X 2 tiles size = 8 X 8 elements map to registers 1/12/2019 HTA
9
Why tiles as first class objects?
Many algorithms are formulated in terms of tiles e.g. linear algebra algorithms (in LAPACK, ATLAS) What is special about HTAs, in particular? HTA provides a convenient way for the programmer to express these types of algorithms Its recursive nature makes it easy to fit in to any memory hierarchy. 1/12/2019 HTA
10
HTA Construction hta(MATRIX,{[delim1],[delim2],…},[processor mesh]) a
mapping P P 1 3 ha 1 3 P P 1 2 4 Explain the example. ha = hta(a,{[1,4],[1,3]},[2,2]); 4 1/12/2019 HTA
11
HTA Addressing 1/12/2019 HTA
12
HTA Addressing tiles h{1,1:2} h{2,1} hierarchical 1/12/2019 HTA
13
HTA Addressing h{1,1:2} h{2,1} h(3,4) elements hierarchical flattened
1/12/2019 HTA
14
HTA Addressing h{1,1:2} h{2,1} h(3,4) h{2,2}(1,2) hierarchical
flattened or h{2,2}(1,2) hybrid 1/12/2019 HTA
15
F90 Conformability size(rhs) == 1 dim(lhs) == dim(rhs) .and.
size(lhs) == size(rhs) 1/12/2019 HTA
16
HTA Conformability All conformable HTAs can be operated using
the primitive operations (add, subtract, etc) 1/12/2019 HTA
17
HTA Assignment t h h{:,:} = t{:,:} h{1,:} = t{2,:}
Mention what is triplet, : means all Introduce conformability earlier implicit communication 1/12/2019 HTA
18
Data parallel functions in F90
1/12/2019 HTA
19
Map r = map h) @sin Recursive 1/12/2019 HTA
20
Map r = map h) @sin @sin @sin @sin @sin Recursive 1/12/2019 HTA
21
Map r = map (@sin, h) @sin @sin @sin @sin @sin @sin @sin @sin @sin
Recursive 1/12/2019 HTA
22
Reduce r = reduce (@max, h) @max @max @max @max @max @max @max @max
Recursive 1/12/2019 HTA
23
Reduce r = reduce (@max, h) @max @max @max @max @max @max @max @max
Recursive 1/12/2019 HTA
24
Reduce r = reduce (@max, h) @max @max @max @max @max Recursive
1/12/2019 HTA
25
Higher level operations
repmat(h, [1, 3]) circshift( h, [0, -1] ) Explain communication transpose(h) 1/12/2019 HTA
26
Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA
27
Blocked Recursive Matrix Multiplication
Blocked-Recursive implementation A{i, k}, B{k, j} and C{i, j} are sub-matrices of A, B and C. function c = matmul (A, B, C) if (level(A) == 0) C = matmul_leaf (A, B, C) else for i = 1:size(A, 1) for k = 1:size(A, 2) for j = 1:size(B, 2) C{i, j} = matmul(A{i, k}, B{k, j}, C{i,j}); end 1/12/2019 HTA
28
Matrix Multiplication
matmul (A, B, C) matmul (A{i, k}, B{k, j}, C{i, j}) matmul (AA{i, k}, BB{k, j}, CC{i, j}) Distributed/Locality matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j}) 1/12/2019 HTA
29
Matrix Multiplication
matmul (A, B, C) matmul (A{i, k}, B{k, j}, C{i, j}) matmul (AA{i, k}, BB{k, j}, CC{i, j}) Distributed/Locality matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j}) for i = 1:size(A, 1) for k = 1:size(A, 2) for j = 1:size(B, 2) C(i,j)= C(i,j) + A(i, k) * B(k, j); end 1/12/2019 HTA
30
SUMMA Matrix Multiplication
Outer-product method for k=1:n for i=1:n for j=1:n C(i,j)=C(i,j)+A(i,k)*B(k,j); end for k=1:n C(:,:)=C(:, :)+A{:,k} * B(k, :); 1/12/2019 HTA
31
SUMMA Matrix Multiplication
Here, C{i,j}, A{i,k}, B{k,j} represent submatrices. The * operator represents tile-by-tile matrix multiplication function C = summa (A, B, C) for k=1:m T1 = repmat(A{:, k}, 1, m); T2 = repmat(B{k, :}, m, 1); C = C + T1 * T2; end B repmat repmat A * T2 T1 1/12/2019 HTA
32
Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA
33
Implementation MATLAB Extension with HTA X10 Extension
Global view & single threaded Front-end is pure MATLAB MPI calls using MEX interface MATLAB library routines at the leaf level X10 Extension Emulation only (no experimental results) C++ extension with HTA (Under-progress) Communication using MPI (& UPC) 1/12/2019 HTA
34
Evaluation Implemented the full set of NAS benchmarks (except SP)
Pure MATLAB & MATLAB + HTA Performance metrics: running time & relative speedup on processors Productivity metric: source lines of code Compared with hand-optimized F77+MPI version Tiles only used for parallelism Matrix-matrix multiplication using HTAs in C++ Tiles used for locality 1/12/2019 HTA
35
Illustrations
36
MG 3D Stencil convolution: primitive HTA operations and
assignments to implement communication Sample code?? restrict interpolate 1/12/2019 HTA
37
MG r{1,:}(2,:) = r{2,:}(1,:); communication
r {:, :}(2:n-1, 2:n-1) = 0.5 * u{:,:}(2:n-1, 2:n-1) * (u{:, :}(1:n-2, 2:n-1) + u{:,:}(3:n, 2:n-1) u{:,:}(2:n-1, 1:n-2) + u{:,:}(2:n-1, 3:n)) n = 3 computation 1/12/2019 HTA
38
CG Sparse matrix-vector multiplication (A*p) 2D decomposition of A
replication and 2D decomposition of p sum reduction across columns p0T p1T p2T p3T A00 A01 A02 A03 A10 A11 A12 A13 Say the tiles are distributed to different processors Vector is transposed * A20 A21 A22 A23 A30 A31 A32 A33 1/12/2019 HTA
39
CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); 2 1 2 p0T p1T p2T p3T p0T p1T p2T p3T * A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 p0T p1T p2T pT3 repmat 1/12/2019 HTA
40
CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); p = map p) 2 3 3 A00 A01 A02 A03 A10 A11 A12 A13 p0 p1 p2 p3 p0 p1 p2 p3 * A20 A21 A22 A23 A30 A31 A32 A33 p0 p1 p2 p3 repmat 1/12/2019 HTA
41
CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); p = map p) R = reduce ( 'sum', A * p, 2, true); 2 3 4 4 reduce R00 R01 R02 R03 R10 R11 R12 R13 R20 R21 R22 R23 R30 R31 R32 R33 A00 A01 A02 A03 A10 A11 A12 A13 p0 p1 p2 p3 p0 p1 p2 p3 = * A20 A21 A22 A23 A30 A31 A32 A33 p0 p1 p2 p3 repmat 1/12/2019 HTA
42
FT 3D Fourier transform Corner turn using dpermute
Fourier transform applied to each tile h = ft (h, 1); A 2D illustration h = dpermute (h, [2 1]); h = ft(h, 1); @ft @ft dpermute ft 1/12/2019 HTA
43
IS Bucket sort of integers Bucket exchange using assignments
Distribute keys Local bucket sort Aggregate Bucket exchange Local sort @sort @sort @sort 1/12/2019 HTA
44
IS Bucket sort of integers Bucket exchange assignments
h and r top level distributed for i = 1:n for j = 1:n r{i}{j} = h{j}{i} end @sort @sort @sort 1/12/2019 HTA
45
Experimental Results NAS benchmarks (MATLAB) (Results in the paper)
CG & FT – acceptable performance and speedup MG, LU, BT – poor performance, but fair speedup C++ results yet to be obtained. Add loc plot 1/12/2019 HTA
46
Matrix Multiplication (C++ implementation)
1/12/2019 HTA
47
Lines of Code Lines of code EP CG MG FT LU 1/12/2019 HTA
48
Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA
49
Conclusion HTA is the first attempt to make tiles part of a language
Treating tiling as a syntactic entity helps improve productivity Extending the vector operations to tiles makes the programs more readable HTA provides a convenient way to express algorithms with tiled formulation Future work C++ library + NAS benchmarks Scalability experiments Evolving the language for locality Other conclusions 1/12/2019 HTA
50
THE END 1/12/2019 HTA
51
Backup slides 1/12/2019 HTA
52
void(conj_grad (SparseHTA A, HTA x, Matrix z) {
z = 0.0; q = 0.0; /* Conjugate Gradient Benchmark in C++ */ r = x; p = r; rho = HTA::dotproduct(r, r); for(int i = 0; i < CGITMAX; i++) { qt = 0; pt = p.ltranspose(); SparseHTA::lmatmul(A, pt, &qt); /* compute A*x */ qt = qt.reduce(plus<double>(), 0.0, 1, 0x2); q = qt.transpose(); alpha = rho / HTA::dotproduct(p, q); z = z + alpha * p; r = r - alpha * q; rho0 = rho; rho = Matrix::dotproduct(r, r); beta = rho / rho0; p = r + beta * p; } qt = 0; /* compute the norm */ HTA zt = z.ltranspose(); SparseHTA::lmatmul(A, zt, &qt); r = qt.transpose(); HTA f = x - r; f = f * f; norm = f.reduce(plus<double>(), 0.0); rnorm = sqrt(norm); 1/12/2019 HTA
53
Related Works Languages Parallel MATLABs excluded Classification
ZPL, CAF, UPC, HPF, X10, CHAPEL Parallel MATLABs excluded Classification Global vs Segmented memory Single vs Multiple thread of execution Language vs Library 1/12/2019 HTA
54
Classification Global view Single threaded Library CAF Titanium UPC
X10 HPF ZPL POOMA HTA G.Arrays POET MPI/PVM Library 1/12/2019 HTA
55
Uni-processor performance challenges
MATLAB Limited conversion to vector format Interpretation costs due to loops Copy-on-write during assignments Poor data abstraction with higher dimensionality C++ Run time costs (e.g. virtual functions) Extensible, polymorphic design Syntax constraints leading to proxies More explanation in point 1 1/12/2019 HTA
56
Multi-processor performance challenges
Eliminating temps in expression evaluation e.g. stencil convolutions in MG Scalable communication operations e.g. reduction & transpose in CG e.g. permute in FT e.g. cumsum & bucket exchange in IS Many to one Mapping & complex mapping e.g. BT Overlap computation & communication e.g. LU More explanation 1/12/2019 HTA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.