An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang Ren 1, Michael Cibulskis 1, Gerald DeJong 1, Maria Garzaran 1, David Padua 1, Paul Stodghill 2, Peng Wu 3 1 UIUC, 2 Cornell University, 3 IBM T.J.Watson
Context: High-performance libraries Traditional approach Hand-optimized code: (e.g.) BLAS Problem: tedious to develop Alternatives: Restructuring compilers General purpose: generate code from high-level specifications Use architectural models to determine optimization parameters Performance of optimized code is not satisfactory Library generators Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT Use empirical optimization to determine optimization parameters Believed to produce optimal code Why are library generators beating compilers?
How important is empirical search? Model-based optimization and empirical optimization are not in conflict Use models to prune search space Search intelligent search Use empirical observations to refine models Understand what essential aspects of reality are missing from model and refine model appropriately Multiple models are fine Learn from experience
Previous work Compared performance of code generated by a sophisticated compiler like SGI MIPSpro code generated by ATLAS found ATLAS code is better Hard to answer why Perhaps ATLAS is effectively doing transformations that compilers do not know about Phase-ordering problem: perhaps compilers are doing transformations in wrong order Perhaps parameters to transformations are chosen sub-optimally by compiler using models
Our Approach Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure Detect Hardware Parameters Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch) Model
Detecting Machine Parameters Micro-benchmarks L1Size: L1 Data Cache size Similar to Hennessy-Patterson book NR: Number of registers MulAdd: Fused Multiply Add (FMA) “c+=a*b” as opposed to “c+=t; t=a*b” Latency: Latency of FP Multiplication
Code Generation: Compiler View Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure
BLAS Let us focus on BLAS-3 Code for MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j] Properties Very good reuse: O(N 2 ) data, O(N 3 ) computation Many optimization opportunities Few “real” dependencies Will run poorly on modern machines Poor use of cache and registers Poor use of processor pipelines
Optimizations Cache-level blocking (tiling) Atlas blocks only for L1 cache Register-level blocking Highest level of memory hierarchy Important to hold array values in registers Software pipelining Unroll and schedule operations Versioning Dynamically decide which way to compute Back-end compiler optimizations Scalar Optimizations Instruction Scheduling
Cache-level blocking (tiling) Tiling in ATLAS Only square tiles (NBxNBxNB) Working set of tile fits in L1 Tiles are usually copied to continuous storage Special “clean-up” code generated for boundaries Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter
Short excursion into tiling
MMM miss ratio L1 Cache Miss Ratio for Intel Pentium III –MMM with N = 1…1300 –16KB 32B/Block 4-way 8-byte elements
IJK version (large cache) DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Large cache scenario: Matrices are small enough to fit into cache Only cold misses, no capacity misses Miss ratio: Data size = 3 N 2 Each miss brings in b floating-point numbers Miss ratio = 3 N 2 /b*4N 3 = 0.75/bN = (b = 4,N=10) C B A K K
IJK version (small cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Small cache scenario: Matrices are large compared to cache reuse distance is not O(1) => miss Cold and capacity misses Miss ratio: C: N 2 /b misses (good temporal locality) A: N 3 /b misses (good spatial locality) B: N 3 misses (poor temporal and spatial locality) Miss ratio 0.25 (b+1)/b = (for b = 4) C B A K K
MMM experiments L1 Cache Miss Ratio for Intel Pentium III –MMM with N = 1…1300 –16KB 32B/Block 4-way 8-byte elements Tile size
Register-level blocking Micro-MMM A: MUx1 B: 1xNU C: MUxNU MUxNU+MU+NU registers Unroll loops by MU, NU, and KU Mini-MMM with Micro-MMM inside for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] MU, NU, KU: optimization parameters KU times
Scheduling FMA Present? Schedule Computation Using Latency Schedule Memory Operations Using IFetch, NFetch,FFetch Latency, xFetch: optimization parameters M1M1 M2M2 M3M3 M4M4 M MU*NU … A1A1 A2A2 A3A3 A4A4 A MU*NU … L1L1 L2L2 L3L3 L MU+NU … Latency=2 A1A1 A2A2 A MU*NU … Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations IFetch Loads NFetch Loads …
Comments Optimization parameters NB: constrained by size of L1 cache MU,NU: constrained by NR KU: constrained by size of I-Cache xFetch: constrained by #of OL MulAdd/Latency: related to hardware parameters Similar parameters would be used by compilers parameter MFlops parameter Sensitive parameterInsensitive parameter
ATLAS Search Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure
High-level picture Multi-dimensional optimization problem: Independent parameters: NB,MU,NU,KU,… Dependent variable: MFlops Function from parameters to variables is given implicitly; can be evaluated repeatedly One optimization strategy: orthogonal range search Optimize along one dimension at a time, using reference values for parameters not yet optimized Not guaranteed to find optimal point, but might come close
Specification of OR Search Order in which dimensions are optimized Reference values for un-optimized dimensions at any step Interval in which range search is done for each dimension
Search strategy 1. Find Best NB 2. Find Best MU & NU 3. Find Best KU 4. Find Best xFetch 5. Find Best Latency (lat) 6. Find non-copy version tile size (NCNB)
Find Best NB Search in following range 16 <= NB <= 80 NB 2 <= L1Size In this search, use simple estimates for other parameters (eg) KU: Test each candidate for Full K unrolling (KU = NB) No K unrolling (KU = 1)
Find best MU, NU : try all MU & NU that satisfy In this step, use best NB from previous step Find best KU Find best Latency [1…6] Find best xFetch IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch] Finding other parameters 1 <= MU,NU <= NB MU*NU + MU + NU <= NR
Our Models Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure
Modeling for Optimization Parameters Optimization parameters NB Hierarchy of Models (later) MU, NU KU maximize subject to L1 Instruction Cache Latency, MulAdd hardware parameters xFetch set to 2
Largest NB for no capacity/conflict misses Tiles are copied into contiguous memory Condition for cold misses only: 3*NB 2 <= L1Size A k B j k i NB
Largest NB for no capacity misses MMM: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Cache model: Fully associative Line size 1 Word Optimal Replacement Bottom line: N 2 +N+1<C One full matrix One row / column One element
Extending the Model Line Size > 1 Spatial locality Array layout in memory matters Bottom line: depending on loop order either or
Extending the Model (cont.) LRU (not optimal replacement) MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Bottom line:
Summary: Modeling for Tile Size (NB) Models of increasing complexity 3*NB 2 ≤ C Whole work-set fits in L1 NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word or Line Size > 1 word or LRU Replacement
Comments Lot of work in compiler literature on automatic tile size selection Not much is known about how well these algorithms do in practice Few comparisons to BLAS Not obvious how one generalizes our models to more complex codes Insight needed: how sensitive is performance to tile size?
Experiments Architectures: SGI R12000, 270MHz Sun UltraSPARC III, 900MHz Intel Pentium III, 550MHz Measure Mini-MMM performance Complete MMM performance Sensitivity of performance to parameter variations
Installation Time of ATLAS vs. Model
MiniMMM Performance SGI ATLAS: 457 MFLOPS Model:453 MFLOPS Difference:1% Sun ATLAS:1287 MFLOPS Model:1052 MFLOPS Difference:20% Intel ATLAS:394 MFLOPS Model:384 MFLOPS Difference:2%
MMM Performance SGI Sun Intel BLAS MODEL F77 ATLAS
Optimization Parameter Values NBMU/NU/KUF/I/N-FetchLatency ATLAS SGI:644/4/640/5/13 Sun:485/3/480/3/55 Intel:402/1/400/3/14 Model SGI:624/4/621/2/26 Sun:884/4/781/2/24 Intel:422/1/421/2/23
Sensitivity to NB and Latency: Sun Tile Size (NB) Latency ATLASMODELBESTATLASMODELBEST
Sensitivity to NB: SGI 3*NB 2 ≤ C 2 NB 2 + NB + 1 ≤ C 2 ATLASMODELBEST
Sensitivity to NB: Intel
Sensitivity to MU,NU: SGI
Sensitivity to MU,NU: Sun
Sensitivity to MU,NU: Intel
Shape of register tile matters
Sensitivity to KU
Conclusions Search is not as important as one might think Compilers can achieve near-ATLAS performance if they implement well known transformations use models to choose parameter values There is room for improvement in both models and empirical search Both are 20-25% slower than BLAS Higher levels of memory hierarchy cannot be neglected
Future Directions Study hand-written BLAS codes to understand performance gap Repeat study with FFTW/SPIRAL Uses search to choose between algorithms Combine models with search Use models to speed up empirical search Use empirical studies to enhance models Feed insights back into compilers How do we make it easier for compiler writers to implement transformations? Use insights to simplify memory system
Information URL:
Sensitivity to Latency: Intel