An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

Slides:

Advertisements

Similar presentations

Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Data Locality CS 524 – High-Performance Computing.

1 Lecture 6 Performance Measurement and Improvement.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Data Locality CS 524 – High-Performance Computing.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Systems I Locality and Caching

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

IT253: Computer Organization

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

CS 612: Software Design for High-performance Architectures.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Memory-Aware Compilation Philip Sweany 10/20/2011.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Zubair.

CMSC 611: Advanced Computer Architecture

A Comparison of Cache-conscious and Cache-oblivious Programs

Cache Memories CSE 238/2038/2138: Systems Programming

Empirical Search and Library Generators

The Hardware/Software Interface CSE351 Winter 2013

The University of Adelaide, School of Computer Science

CS 105 Tour of the Black Holes of Computing

Automatic Performance Tuning

Memory Hierarchies.

Automatic Measurement of Instruction Cache Capacity in X-Ray

CS 380C: Advanced Compiler Techniques

Optimizing MMM & ATLAS Library Generator

A Comparison of Cache-conscious and Cache-oblivious Codes

Siddhartha Chatterjee

What does it take to produce near-peak Matrix-Matrix Multiply

Memory System Performance Chapter 3

Cache Models and Program Transformations

Cache-oblivious Programming

Cache Performance Improvements

Optimizing single thread performance

The Price of Cache-obliviousness

An analytical model for ATLAS

Writing Cache Friendly Code

Presentation transcript:

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang Ren 1, Michael Cibulskis 1, Gerald DeJong 1, Maria Garzaran 1, David Padua 1, Paul Stodghill 2, Peng Wu 3 1 UIUC, 2 Cornell University, 3 IBM T.J.Watson

Context: High-performance libraries Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to develop Alternatives:  Restructuring compilers General purpose: generate code from high-level specifications Use architectural models to determine optimization parameters Performance of optimized code is not satisfactory  Library generators Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT Use empirical optimization to determine optimization parameters Believed to produce optimal code Why are library generators beating compilers?

How important is empirical search? Model-based optimization and empirical optimization are not in conflict  Use models to prune search space Search  intelligent search  Use empirical observations to refine models Understand what essential aspects of reality are missing from model and refine model appropriately  Multiple models are fine Learn from experience

Previous work Compared performance of  code generated by a sophisticated compiler like SGI MIPSpro  code generated by ATLAS  found ATLAS code is better Hard to answer why  Perhaps ATLAS is effectively doing transformations that compilers do not know about  Phase-ordering problem: perhaps compilers are doing transformations in wrong order  Perhaps parameters to transformations are chosen sub-optimally by compiler using models

Our Approach Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure Detect Hardware Parameters Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch) Model

Detecting Machine Parameters Micro-benchmarks  L1Size: L1 Data Cache size Similar to Hennessy-Patterson book  NR: Number of registers  MulAdd: Fused Multiply Add (FMA) “c+=a*b” as opposed to “c+=t; t=a*b”  Latency: Latency of FP Multiplication

Code Generation: Compiler View Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure

BLAS Let us focus on BLAS-3 Code for MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j] Properties  Very good reuse: O(N 2 ) data, O(N 3 ) computation  Many optimization opportunities Few “real” dependencies  Will run poorly on modern machines Poor use of cache and registers Poor use of processor pipelines

Optimizations Cache-level blocking (tiling)  Atlas blocks only for L1 cache Register-level blocking  Highest level of memory hierarchy  Important to hold array values in registers Software pipelining  Unroll and schedule operations Versioning  Dynamically decide which way to compute Back-end compiler optimizations  Scalar Optimizations  Instruction Scheduling

Cache-level blocking (tiling) Tiling in ATLAS  Only square tiles (NBxNBxNB)  Working set of tile fits in L1  Tiles are usually copied to continuous storage  Special “clean-up” code generated for boundaries Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter

Short excursion into tiling

MMM miss ratio L1 Cache Miss Ratio for Intel Pentium III –MMM with N = 1…1300 –16KB 32B/Block 4-way 8-byte elements

IJK version (large cache) DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Large cache scenario:  Matrices are small enough to fit into cache  Only cold misses, no capacity misses  Miss ratio: Data size = 3 N 2 Each miss brings in b floating-point numbers Miss ratio = 3 N 2 /b*4N 3 = 0.75/bN = (b = 4,N=10) C B A K K

IJK version (small cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Small cache scenario:  Matrices are large compared to cache reuse distance is not O(1) => miss  Cold and capacity misses  Miss ratio: C: N 2 /b misses (good temporal locality) A: N 3 /b misses (good spatial locality) B: N 3 misses (poor temporal and spatial locality) Miss ratio  0.25 (b+1)/b = (for b = 4) C B A K K

MMM experiments L1 Cache Miss Ratio for Intel Pentium III –MMM with N = 1…1300 –16KB 32B/Block 4-way 8-byte elements Tile size

Register-level blocking Micro-MMM  A: MUx1  B: 1xNU  C: MUxNU  MUxNU+MU+NU registers Unroll loops by MU, NU, and KU Mini-MMM with Micro-MMM inside for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] MU, NU, KU: optimization parameters KU times

Scheduling FMA Present? Schedule Computation  Using Latency Schedule Memory Operations  Using IFetch, NFetch,FFetch Latency, xFetch: optimization parameters M1M1 M2M2 M3M3 M4M4 M MU*NU … A1A1 A2A2 A3A3 A4A4 A MU*NU … L1L1 L2L2 L3L3 L MU+NU … Latency=2 A1A1 A2A2 A MU*NU … Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations IFetch Loads NFetch Loads …

Comments Optimization parameters  NB: constrained by size of L1 cache  MU,NU: constrained by NR  KU: constrained by size of I-Cache  xFetch: constrained by #of OL  MulAdd/Latency: related to hardware parameters Similar parameters would be used by compilers parameter MFlops parameter Sensitive parameterInsensitive parameter

ATLAS Search Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure

High-level picture Multi-dimensional optimization problem:  Independent parameters: NB,MU,NU,KU,…  Dependent variable: MFlops  Function from parameters to variables is given implicitly; can be evaluated repeatedly One optimization strategy: orthogonal range search  Optimize along one dimension at a time, using reference values for parameters not yet optimized  Not guaranteed to find optimal point, but might come close

Specification of OR Search Order in which dimensions are optimized Reference values for un-optimized dimensions at any step Interval in which range search is done for each dimension

Search strategy 1. Find Best NB 2. Find Best MU & NU 3. Find Best KU 4. Find Best xFetch 5. Find Best Latency (lat) 6. Find non-copy version tile size (NCNB)

Find Best NB Search in following range  16 <= NB <= 80  NB 2 <= L1Size In this search, use simple estimates for other parameters  (eg) KU: Test each candidate for Full K unrolling (KU = NB) No K unrolling (KU = 1)

Find best MU, NU : try all MU & NU that satisfy  In this step, use best NB from previous step Find best KU Find best Latency [1…6] Find best xFetch IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch] Finding other parameters 1 <= MU,NU <= NB MU*NU + MU + NU <= NR

Our Models Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure

Modeling for Optimization Parameters Optimization parameters  NB Hierarchy of Models (later)  MU, NU  KU maximize subject to L1 Instruction Cache  Latency, MulAdd hardware parameters  xFetch set to 2

Largest NB for no capacity/conflict misses Tiles are copied into contiguous memory Condition for cold misses only:  3*NB 2 <= L1Size A k B j k i NB

Largest NB for no capacity misses MMM: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Cache model:  Fully associative  Line size 1 Word  Optimal Replacement Bottom line: N 2 +N+1<C  One full matrix  One row / column  One element

Extending the Model Line Size > 1  Spatial locality  Array layout in memory matters Bottom line: depending on loop order  either  or

Extending the Model (cont.) LRU (not optimal replacement) MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Bottom line:

Summary: Modeling for Tile Size (NB) Models of increasing complexity  3*NB 2 ≤ C Whole work-set fits in L1  NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word  or Line Size > 1 word  or LRU Replacement

Comments Lot of work in compiler literature on automatic tile size selection Not much is known about how well these algorithms do in practice  Few comparisons to BLAS Not obvious how one generalizes our models to more complex codes  Insight needed: how sensitive is performance to tile size?

Experiments Architectures:  SGI R12000, 270MHz  Sun UltraSPARC III, 900MHz  Intel Pentium III, 550MHz Measure  Mini-MMM performance  Complete MMM performance  Sensitivity of performance to parameter variations

Installation Time of ATLAS vs. Model

MiniMMM Performance SGI  ATLAS: 457 MFLOPS  Model:453 MFLOPS  Difference:1% Sun  ATLAS:1287 MFLOPS  Model:1052 MFLOPS  Difference:20% Intel  ATLAS:394 MFLOPS  Model:384 MFLOPS  Difference:2%

MMM Performance SGI Sun Intel BLAS MODEL F77 ATLAS

Optimization Parameter Values NBMU/NU/KUF/I/N-FetchLatency ATLAS  SGI:644/4/640/5/13  Sun:485/3/480/3/55  Intel:402/1/400/3/14 Model  SGI:624/4/621/2/26  Sun:884/4/781/2/24  Intel:422/1/421/2/23

Sensitivity to NB and Latency: Sun Tile Size (NB) Latency ATLASMODELBESTATLASMODELBEST

Sensitivity to NB: SGI 3*NB 2 ≤ C 2 NB 2 + NB + 1 ≤ C 2 ATLASMODELBEST

Sensitivity to NB: Intel

Sensitivity to MU,NU: SGI

Sensitivity to MU,NU: Sun

Sensitivity to MU,NU: Intel

Shape of register tile matters

Sensitivity to KU

Conclusions Search is not as important as one might think Compilers can achieve near-ATLAS performance if they  implement well known transformations  use models to choose parameter values There is room for improvement in both models and empirical search  Both are 20-25% slower than BLAS  Higher levels of memory hierarchy cannot be neglected

Future Directions Study hand-written BLAS codes to understand performance gap Repeat study with FFTW/SPIRAL  Uses search to choose between algorithms Combine models with search  Use models to speed up empirical search  Use empirical studies to enhance models Feed insights back into compilers  How do we make it easier for compiler writers to implement transformations? Use insights to simplify memory system

Information URL:

Sensitivity to Latency: Intel