Download presentation
Presentation is loading. Please wait.
Published byFelix Lawrence Modified over 9 years ago
1
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio
2
Improving Memory Performance processor Registers Cache Memory processor Registers Cache Memory Shared memory Network connection Locality optimizations Parallelization Communication optimizations
3
Compiler Optimizations For Locality Computation optimizations Loop blocking, fusion, unroll-and-jam, interchange, unrolling Rearrange computations for better spatial and temporal locality Data-layout optimizations Static layout transformation A single layout for entire application No additional overhead, tradeoff between different layout choices Dynamic layout transformation A different layout for each computation phase Flexible but could be expensive Combining computation and data transformations Static layout transformation Transform layout first, then computation Dynamic layout transformation Transform computation first, then dynamic re-arrange layout
4
Array Copying --- Related Work Dynamic layout transformation for arrays Copy arrays into local buffers before computation Copy modified local buffers back to array Previous work Lam, Rothberg and Wolf, Temam, Granston and Jalby Copy arrays after loop blocking Assume arrays are accessed through affine expressions Array copy always safe Static data layout transformations A single layout throughout the application --- always safe Optimizing irregular applications Data access patterns not known until runtime Dynamic layout transformation --- through libraries Scalar Replacement Equivalent to copying single array element into scalars Carr and Kennedy: applied to inner loops Question: how expensive is array copy? How difficult is it?
5
What is new Array copy for arbitrary loop computations Stand-alone optimization independent of blocing Works for computations with non-affine array access General array copy algorithm Work on arbitrarily shaped loops Automatic insert copy operations to ensure safety Heuristics to reduce buffer size and copy cost Performs scalar replacement when specialized Applications Improve cache and register locality When combined with blocking and when without blocking Future work Communication and parallelization Empirical tuning of optimizations Interface that allows different application heuristics
6
Apply Array Copy: Matrix Multiplication Step1: build dependence graph True, output, anti and input deps between array accesses Is each dependence precise? for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[i+j*m] A[i+k*m]B[k+j*l]
7
Apply Array Copy: Matrix Multiplication Step2-3: Separate array references connected by imprecise deps Impose an order on all array references C[i+j*m] (R) -> A[i+k*m] -> B[k+j*l] -> C[i+j*m] (W) Remove all back edges Apply typed fusion for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[i+j*m] A[i+k*m]B[k+j*l]
8
Array Copy with Imprecise Dependences Array references connected by imprecise deps Cannot precisely determine a mapping between subscripts Sometimes may refer to the same location, sometimes not Not safe to copy into a single buffer Apply typed-fusion algorithm to separate them for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) { C[f(i,j,m)] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[f(i,j,m)] C[i+j*m] A[i+k*m]B[k+j*l] Imprecise
9
Profitability of Applying Array Copy Each buffer should be copied at most twice (splitting groups) Determine outermost location to perform copy No interference with other groups Enforce size limit on the buffer constant size => scalar replacement Ensure reuse of copied elements Lowering position to copy if no extra reuse is gained for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[i+j*m] A[i+k*m]B[k+j*l] location to copy A j i k k k k location to copy C location to copy B
10
Array Copy Result: Matrix Multiplication Dimensionality of buffer enforced by command-line options Can be applied to arbitrarily shaped loop structures Can be applied independently or after blocking Copy A[0:m,0:l] to A_buf; for (j=0; j<n; ++j) { Copy C[0:m, j*m] to C_buf; for (k=0; k<l; ++k) { Copy B[k+j*l] to B_buf; for (i=0; i<m; ++i) C_buf[i] = C_buf[i] + alpha * A_buf[i+k*m]*B_buf; } Copy C_buf[0:m] back to C[0:m,j*m]; }
11
Experiments Implementation Loop transformation framework by Yi, Kennedy and Adve ROSE compiler infrastructure at LLNL (Quinlan et al) Benchmarks dgemm (matrix multiplication, LAPACK) dgetrf (LU factorization with partial pivoting, LAPACK) tomcatv (mesh generation with Thompson solver, SPEC95) Machines A Dell PC with two 2.2GHz Intel XEON processors 512KB cache on each processor and 2GB memory A SGI workstation with a 195 MHz R10000 processor 32KB 2-way L1, 1MB 4-way L2, and 256MB memory; A single 8-way P655+ node on a IBM terascale machine 32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2, 16MB 8-way L3 16GB memory
12
DGEMM on Dell PC
13
DGEMM on SGI Workstation
14
DGEMM on IBM
15
Summary When should we apply scalar replacement? Profitable unless register pressure too high 3-12% improvement observed No overhead When should we apply array copy? When regular conflict misses occur When prefetching of array elements is need 10-40% improvement observed Overhead is 0.5-8% when not beneficial Optimizations not beneficial on the IBM node Both blocking and 2-dim copying not profitable Integer operation overhead too high Will be further investigated
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.