What does it take to produce near-peak Matrix-Matrix Multiply

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
Data Locality CS 524 – High-Performance Computing.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,
Data Locality CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.
Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Memory-Aware Compilation Philip Sweany 10/20/2011.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
1 Lecture 5a: CPU architecture 101 boris.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
CMSC 611: Advanced Computer Architecture
Cache Memories.
Topics to be covered Instruction Execution Characteristics
CE 454 Computer Architecture
CS427 Multicore Architecture and Parallel Computing
A Comparison of Cache-conscious and Cache-oblivious Programs
Associativity in Caches Lecture 25
Cache Memories CSE 238/2038/2138: Systems Programming
Empirical Search and Library Generators
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
The University of Adelaide, School of Computer Science
Optimization Code Optimization ©SoftMoore Consulting.
5.2 Eleven Advanced Optimizations of Cache Performance
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
CS 105 Tour of the Black Holes of Computing
BLAS: behind the scenes
Automatic Performance Tuning
Memory Hierarchies.
Automatic Measurement of Instruction Cache Capacity in X-Ray
CS 380C: Advanced Compiler Techniques
Optimizing MMM & ATLAS Library Generator
A Comparison of Cache-conscious and Cache-oblivious Codes
Siddhartha Chatterjee
What is Computer Architecture?
What is Computer Architecture?
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
CPU Structure CPU must:
Cache-oblivious Programming
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Cache Performance Improvements
6- General Purpose GPU Programming
Introduction to Computer Systems Engineering
Optimizing single thread performance
The Price of Cache-obliviousness
An analytical model for ATLAS
Writing Cache Friendly Code
Presentation transcript:

What does it take to produce near-peak Matrix-Matrix Multiply Kamen Yotov Department of Computer Science Cornell University CS612 Lecture, 02/08/05 4/19/2019 CS612

Approaches to fast MMM Traditional approach Alternatives: Hand-optimized code: (e.g.) most native BLAS Problem: costly and tedious to develop Alternatives: Restructuring compilers General purpose: generate code from high-level specifications Use architectural models to determine optimization parameters Performance of optimized code is not satisfactory Library generators Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT Use empirical optimization to determine optimization parameters Believed to produce optimal code 4/19/2019 CS612

Outline ATLAS Modeling Experimental Results Hardware parameters Transformations Optimization parameters Modeling Experimental Results 4/19/2019 CS612

Architecture 4/19/2019 CS612

Architecture 4/19/2019 CS612

Measure Machine Parameters Empirical Optimization L1Size – similar to Saavedra Benchmark NR, FMA, Lh Not exact, used to bound the search space Model-driven Need exact values Uses X-Ray http://iss.cs.cornell.edu/Software/X-Ray.aspx 4/19/2019 CS612

Architecture 4/19/2019 CS612

Code Generation: Compiler Perspective ATLAS is not a compiler Code Generator output equivalent to Cache-level blocking (tiling) NB Register-level blocking MU, NU, KU Scheduling + software pipelining FMA, LS, FF, IF, NF Versioning Back-end compiler optimizations 4/19/2019 CS612

Cache-level blocking (tiling) Tiling in ATLAS Only square tiles (NBxNBxNB) Working set of tile fits in L1 Tiles are usually copied to continuous storage Special “clean-up” code generated for boundaries Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter 4/19/2019 CS612

Register-level blocking Register blocking is to mini-MMM what cache blocking is to MMM Very important for performance – highest level of Memory Hierarchy Register file is a form of cache Fully associative Unit line size Optimal replacement Register file is under software control It is not “transparent” to software like normal caches Compilers allocate variables to registers, performs spills, etc. Compilers seldom allocate array elements to registers To block for the register file one needs Unroll the corresponding loops Scalarize array elements into variables 4/19/2019 CS612

Register-level blocking (cont.) Micro-MMM A: MUx1 B: 1xNU C: MUxNU MUxNU+MU+NU registers Unroll loops by MU, NU, and KU Mini-MMM with Micro-MMM inside for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Special clean-up code required if NB is not a multiple of MU,NU,KU MU, NU, KU: optimization parameters 4/19/2019 CS612

Scheduling FMA Present? Schedule Computation Memory Operations IFetch Loads NFetch Loads … Memory Operations FMA Present? Schedule Computation Using LS Schedule Memory Operations Using IF, NF,FF LS, IF, NF, FF: optimization parameters M1 M2 M3 M4 MMU*NU … A1 A2 A3 A4 AMU*NU … Latency=2 L1 L2 L3 A1 A2 AMU*NU … … LMU+NU 4/19/2019 CS612

Code Generation: Optimization Parameters NB: constrained by size of L1 cache MU, NU: constrained by NR KU: constrained by size of I-Cache FF, IF, NF: constrained by number of outstanding loads FMA, LS: related to hardware parameters Sensitive vs. insensitive parameters Similar parameters used by compilers 4/19/2019 CS612

Architecture 4/19/2019 CS612

ATLAS Search Strategy Multi-dimensional optimization problem: Independent parameters: NB, MU, NU, KU, … Dependent variable: MFLOPS Function from parameters to variables Is given implicitly Can be evaluated repeatedly One optimization strategy: orthogonal line search Optimize along one dimension at a time Use reference values for parameters not yet optimized Not guaranteed to find optimal point, but might come close Specification of orthogonal line search Order in which dimensions are optimized Reference values for un-optimized dimensions at any step Interval in which line search is done for each dimension 4/19/2019 CS612

Search for best NB Search in following range NB2 <= L1Size In this search, use simple estimates for other parameters e.g. for KU: test each NB candidate for Full K unrolling (KU = NB) No K unrolling (KU = 1) 4/19/2019 CS612

Search for other parameters MU and NU Constraints 1 <= MU, NU <= NB MU*NU + MU + NU <= NR Use best found NB from previous step KU KU = 1, 4, 8, …, NB/2, and NB LS 1 <= LS <= 6 FF, IF, and NF FF = 0, 1 2 <= IF <= MU + NU 1 <= NF <= MU + NU - IF 4/19/2019 CS612

Architecture 4/19/2019 CS612

Models for Optimization Parameters NB Hierarchy of Models (follows) MU and NU MU * NU + MU + NU + LS <= NR KU maximize subject to L1 Instruction Cache LS LS = (Lh * |ALU| + 1)/2 FF, IF, NF FF=0, IF=2, NF=2 FMA hardware parameter 4/19/2019 CS612

Models for NB: No capacity / conflict misses Tiles copied to contiguous memory Simplest model: 3 * NB2 <= L1Size 4/19/2019 CS612

Models for NB: No capacity misses / some conflicts MMM: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Cache model: Fully associative Line size 1 Word Optimal Replacement Bottom line: NB2 + NB + 1 <= L1Size One full matrix One row / column One element 4/19/2019 CS612

Models for NB: Extension for Line Size Spatial locality Array layout in memory matters Bottom line This is for JIK loop order, column major layout ATLAS uses this order / layout for Mini-MMM 4/19/2019 CS612

Models for NB: Extension for LRU Replacement Intuition Need more storage than optimal replacement Data needs to cool down in cache to be replaced This means smaller tiles MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Bottom line: 4/19/2019 CS612

Models for NB: Accounting for intra-level interactions Mini-MMM is not scalar operations underneath It is a sequence of Micro-MMMs, operating on register tiles So whenever we talked about rows and columns we should consider vertical and horizontal panels of register tiles instead! We can refine the model for NB as follows: 4/19/2019 CS612

Models for NB: Summary Models of increasing complexity 3 x NB2 ≤ C Whole work-set fits in L1 NB2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word Line Size > 1 word LRU Replacement Multi-level interactions Let’s look at some experiments… 4/19/2019 CS612

Models: Summary This is the complete model It contains a small refinement for MU and NU on x86 4/19/2019 CS612

Experimental Results Ten modern architectures Model did well on RISC architectures UltraSPARC: did better Model did not do as well on Itanium 2 x86 CISC architectures Sometimes substantial gap between ATLAS CGw/S ATLAS Unleashed 4/19/2019 CS612

Sensitivity Analysis Sensitivity graphs are invaluable for tracking down performance problems Such graphs are useful for figuring out parameters in your homework! Example: Sensitivity to NB Use ATLAS values for all parameters other than NB Vary NB and see how performance is affected Gives a 2D slice of a high-dimensional performance surface Similar for other parameters 4/19/2019 CS612

Example Sensitivity Graphs 4/19/2019 CS612

Complete MMM Results Intel Itanium 2 AMD Opteron 240 Sun UltraSPARC IIIi SGI R12K 4/19/2019 CS612