What does it take to produce near-peak Matrix-Matrix Multiply

What does it take to produce near-peak Matrix-Matrix Multiply
Kamen Yotov Department of Computer Science Cornell University CS612 Lecture, 02/08/05 4/19/2019 CS612

Approaches to fast MMM Traditional approach Alternatives:
Hand-optimized code: (e.g.) most native BLAS Problem: costly and tedious to develop Alternatives: Restructuring compilers General purpose: generate code from high-level specifications Use architectural models to determine optimization parameters Performance of optimized code is not satisfactory Library generators Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT Use empirical optimization to determine optimization parameters Believed to produce optimal code 4/19/2019 CS612

Outline ATLAS Modeling Experimental Results Hardware parameters
Transformations Optimization parameters Modeling Experimental Results 4/19/2019 CS612

Architecture 4/19/2019 CS612

Measure Machine Parameters
Empirical Optimization L1Size – similar to Saavedra Benchmark NR, FMA, Lh Not exact, used to bound the search space Model-driven Need exact values Uses X-Ray 4/19/2019 CS612

Code Generation: Compiler Perspective
ATLAS is not a compiler Code Generator output equivalent to Cache-level blocking (tiling) NB Register-level blocking MU, NU, KU Scheduling + software pipelining FMA, LS, FF, IF, NF Versioning Back-end compiler optimizations 4/19/2019 CS612

Cache-level blocking (tiling)
Tiling in ATLAS Only square tiles (NBxNBxNB) Working set of tile fits in L1 Tiles are usually copied to continuous storage Special “clean-up” code generated for boundaries Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter 4/19/2019 CS612

Register-level blocking
Register blocking is to mini-MMM what cache blocking is to MMM Very important for performance – highest level of Memory Hierarchy Register file is a form of cache Fully associative Unit line size Optimal replacement Register file is under software control It is not “transparent” to software like normal caches Compilers allocate variables to registers, performs spills, etc. Compilers seldom allocate array elements to registers To block for the register file one needs Unroll the corresponding loops Scalarize array elements into variables 4/19/2019 CS612

Register-level blocking (cont.)
Micro-MMM A: MUx1 B: 1xNU C: MUxNU MUxNU+MU+NU registers Unroll loops by MU, NU, and KU Mini-MMM with Micro-MMM inside for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Special clean-up code required if NB is not a multiple of MU,NU,KU MU, NU, KU: optimization parameters 4/19/2019 CS612

Scheduling FMA Present? Schedule Computation
Memory Operations IFetch Loads NFetch Loads … Memory Operations FMA Present? Schedule Computation Using LS Schedule Memory Operations Using IF, NF,FF LS, IF, NF, FF: optimization parameters M1 M2 M3 M4 MMU*NU … A1 A2 A3 A4 AMU*NU … Latency=2 L1 L2 L3 A1 A2 AMU*NU … … LMU+NU 4/19/2019 CS612

Code Generation: Optimization Parameters
NB: constrained by size of L1 cache MU, NU: constrained by NR KU: constrained by size of I-Cache FF, IF, NF: constrained by number of outstanding loads FMA, LS: related to hardware parameters Sensitive vs. insensitive parameters Similar parameters used by compilers 4/19/2019 CS612

ATLAS Search Strategy Multi-dimensional optimization problem:
Independent parameters: NB, MU, NU, KU, … Dependent variable: MFLOPS Function from parameters to variables Is given implicitly Can be evaluated repeatedly One optimization strategy: orthogonal line search Optimize along one dimension at a time Use reference values for parameters not yet optimized Not guaranteed to find optimal point, but might come close Specification of orthogonal line search Order in which dimensions are optimized Reference values for un-optimized dimensions at any step Interval in which line search is done for each dimension 4/19/2019 CS612

Search for best NB Search in following range
NB2 <= L1Size In this search, use simple estimates for other parameters e.g. for KU: test each NB candidate for Full K unrolling (KU = NB) No K unrolling (KU = 1) 4/19/2019 CS612

Search for other parameters
MU and NU Constraints 1 <= MU, NU <= NB MU*NU + MU + NU <= NR Use best found NB from previous step KU KU = 1, 4, 8, …, NB/2, and NB LS 1 <= LS <= 6 FF, IF, and NF FF = 0, 1 2 <= IF <= MU + NU 1 <= NF <= MU + NU - IF 4/19/2019 CS612

Models for Optimization Parameters
NB Hierarchy of Models (follows) MU and NU MU * NU + MU + NU + LS <= NR KU maximize subject to L1 Instruction Cache LS LS = (Lh * |ALU| + 1)/2 FF, IF, NF FF=0, IF=2, NF=2 FMA hardware parameter 4/19/2019 CS612

Models for NB: No capacity / conflict misses
Tiles copied to contiguous memory Simplest model: 3 * NB2 <= L1Size 4/19/2019 CS612

Models for NB: No capacity misses / some conflicts
MMM: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Cache model: Fully associative Line size 1 Word Optimal Replacement Bottom line: NB2 + NB + 1 <= L1Size One full matrix One row / column One element 4/19/2019 CS612

Models for NB: Extension for Line Size
Spatial locality Array layout in memory matters Bottom line This is for JIK loop order, column major layout ATLAS uses this order / layout for Mini-MMM 4/19/2019 CS612

Models for NB: Extension for LRU Replacement
Intuition Need more storage than optimal replacement Data needs to cool down in cache to be replaced This means smaller tiles MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Bottom line: 4/19/2019 CS612

Models for NB: Accounting for intra-level interactions
Mini-MMM is not scalar operations underneath It is a sequence of Micro-MMMs, operating on register tiles So whenever we talked about rows and columns we should consider vertical and horizontal panels of register tiles instead! We can refine the model for NB as follows: 4/19/2019 CS612

Models for NB: Summary Models of increasing complexity 3 x NB2 ≤ C
Whole work-set fits in L1 NB2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word Line Size > 1 word LRU Replacement Multi-level interactions Let’s look at some experiments… 4/19/2019 CS612

Models: Summary This is the complete model
It contains a small refinement for MU and NU on x86 4/19/2019 CS612

Experimental Results Ten modern architectures Model did well on
RISC architectures UltraSPARC: did better Model did not do as well on Itanium 2 x86 CISC architectures Sometimes substantial gap between ATLAS CGw/S ATLAS Unleashed 4/19/2019 CS612

Sensitivity Analysis Sensitivity graphs are invaluable for tracking down performance problems Such graphs are useful for figuring out parameters in your homework! Example: Sensitivity to NB Use ATLAS values for all parameters other than NB Vary NB and see how performance is affected Gives a 2D slice of a high-dimensional performance surface Similar for other parameters 4/19/2019 CS612

Example Sensitivity Graphs
4/19/2019 CS612

Complete MMM Results Intel Itanium 2 AMD Opteron 240
Sun UltraSPARC IIIi SGI R12K 4/19/2019 CS612

What does it take to produce near-peak Matrix-Matrix Multiply

Similar presentations

Presentation on theme: "What does it take to produce near-peak Matrix-Matrix Multiply"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What does it take to produce near-peak Matrix-Matrix Multiply

Similar presentations

Presentation on theme: "What does it take to produce near-peak Matrix-Matrix Multiply"— Presentation transcript:

Similar presentations

About project

Feedback