An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Slides:

Advertisements

Similar presentations

Autotuning at Illinois María Jesús Garzarán University of Illinois.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Data Locality CS 524 – High-Performance Computing.

1 Lecture 6 Performance Measurement and Improvement.

RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Data Locality CS 524 – High-Performance Computing.

Overview: Memory Memory Organization: General Issues (Hardware) –Objectives in Memory Design –Memory Types –Memory Hierarchies Memory Management (Software.

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Generic Code Optimization Jack Dongarra, Shirley Moore, Keith Seymour, and Haihang You LACSI Symposium Automatic Tuning of Whole Applications Workshop.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

Reduced Instruction Set Computing Ammi Blankrot April 26, 2011 (RISC)

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Memory-Aware Compilation Philip Sweany 10/20/2011.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Ioannis E. Venetis Department of Computer Engineering and Informatics

A Comparison of Cache-conscious and Cache-oblivious Programs

Empirical Search and Library Generators

Automatic Performance Tuning

Memory Hierarchies.

Automatic Measurement of Instruction Cache Capacity in X-Ray

Register Pressure Guided Unroll-and-Jam

Optimizing MMM & ATLAS Library Generator

A Comparison of Cache-conscious and Cache-oblivious Codes

CSC3050 – Computer Architecture

What does it take to produce near-peak Matrix-Matrix Multiply

Cache Models and Program Transformations

CPU Structure CPU must:

Cache-oblivious Programming

Cache Performance Improvements

Introduction to Optimization

The Price of Cache-obliviousness

An analytical model for ATLAS

Presentation transcript:

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael Cibulskis 1, Gerald DeJong 1, Maria Garzaran 1, David Padua 1, Keshav Pingali 2, Paul Stodghill 2, Peng Wu 3 1 UIUC, 2 Cornell University, 3 IBM T.J.Watson

Motivation Autonomic computing  self-configuration  self-optimization  self-x Question  How important is empirical search? Model-based vs. empirical optimization  Not in conflict!  Use models to prune search space Search  Intelligent Search  Use empirical observations to refine models Learn from experience

Optimization: Compilers vs. Library Generators Traditional Compilers  Model-based, optimizing compiler  Pros Fast General-purpose  Cons Produces sub-optimal performance Architecture specific ATLAS / FFTW / Spiral  Search-based, simple compiler  Cons Slow Problem-specific  Pros Believed to get near- optimal performance Adapt to the architecture

Previous Work Compared performance of Sophisticated compilers (like SGI MIPSpro) ATLAS Found that ATLAS code is better Hard to answer why… Maybe ATLAS:  has transformations compilers do not know about  is performing same transformations but in different order (phase-ordering problem)  chooses transformation parameters better

Our Approach Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters Model NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Original ATLAS Infrastructure Model-Based ATLAS Infrastructure Detect Hardware Parameters Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch) Model

Detecting Machine Parameters Micro-benchmarks  L1Size: L1 Data Cache size Similar to Hennessy-Patterson book  NR: Number of registers Use several FP temporaries repeatedly  MulAdd: Fused Multiply Add (FMA) “c+=a*b” as opposed to “c+=t; t=a*b”  Latency: Latency of FP Multiplication Needed for scheduling multiplies and adds in the absance of FMA

Compiler View ATLAS Code Generation Focus on MMM (as part of BLAS-3)  Very good reuse O(N 2 ) data, O(N 3 ) computation  No “real” dependecies (only input / reuse ones) Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS ATLAS MM Code Generator (MMCase)

Optimizations Cache-level blocking (tiling)  Atlas blocks only for L1 cache Register-level blocking  Highest level of memory hierarchy  Important to hold array values in registers Software pipelining  Unroll and schedule operations Versioning  Dynamically decide which way to compute Back-end compiler optimizations  Scalar Optimizations  Instruction Scheduling

Cache-level blocking (tiling) Tiling in ATLAS  Only square tiles (NBxNBxNB)  Working set of tile fits in L1  Tiles are usually copied to continuous storage  Special “clean-up” code generated for bounderies Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter

Register-level blocking Micro-MMM  MUx1 elements of A  1xNU elements of B  MUxNU sub-matrix of C  MU*NU + MU + NU ≤ NR Mini-MMM revised for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Unroll K look KU times MU, NU, KU: optimization parameters

Scheduling FMA Present? Schedule Computation  Using Latency Schedule Memory Operations  Using FFetch, IFetch, NFetch Mini-MMM revised for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k += KU) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s... load A[i..i+MU-1,k+KU-1] into registers load B[k+KU-1,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Latency, xFetch: optimization parameters KU times M1M1 M2M2 M3M3 M4M4 M MU*NU … A1A1 A2A2 A3A3 A4A4 A MU*NU … L1L1 L2L2 L3L3 L MU+NU … Latency=2 A1A1 A2A2 A MU*NU … Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations IFetch Loads NFetch Loads …

Searching for Optimization Parameters ATLAS Search Engine Multi-dimensional search problem  Optimization parameters are independent variables  MFLOPS is the dependent variable  Function is implicit but can be repeatedly evaluated Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS ATLAS Search Engine (MMSearch)

Search Strategy Orthogonal Range Search  Optimize along one dimension at a time, using reference values for not-yet-optimized parameters  Not guaranteed to find optimal point  Input Order in which dimensions are optimized  NB, MU & NU, KU, xFetch, Latency Interval in which search is done in each dimension  For NB it is step 4 Reference values for not-yet-optimized dimensions  Reference values for KU during NB search are 1 and NB

Modeling for Optimization Parameters Our Modeling Engine Optimization parameters  NB: Hierarchy of Models (later)  MU, NU:  KU: maximize subject to L1 Instruction Cache  Latency, MulAdd: from hardware parameters  xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Model

Modeling for Tile Size (NB) Models of increasing complexity  3*NB 2 ≤ C Whole work-set fits in L1  NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word  or Line Size > 1 word  or LRU Replacement

Experiments Architectures:  SGI R12000, 270MHz  Sun UltraSPARC III, 900MHz  Intel Pentium III, 550MHz Measure  Mini-MMM performance  Complete MMM performance  Sensitivity to variations on parameters

Installation Time of ATLAS vs. Model

Optimization Parameter Values NBMU/NU/KUF/I/N-FetchLatency ATLAS  SGI:644/4/640/5/13  Sun:485/3/480/3/55  Intel:402/1/400/3/14 Model  SGI:624/4/621/2/26  Sun:884/4/781/2/24  Intel:422/1/421/2/23

MiniMMM Performance SGI  ATLAS: 457 MFLOPS  Model:453 MFLOPS  Difference:1% Sun  ATLAS:1287 MFLOPS  Model:1052 MFLOPS  Difference:20% Intel  ATLAS:394 MFLOPS  Model:384 MFLOPS  Difference:2%

MMM Performance SGI Sun Intel

Sensitivity to NB and Latency on Sun Tile Size (NB) MU & NU, KU, Latency, xFetch for all architectures Latency

Sensitivity to NB on SGI 3*NB 2 ≤ C NB 2 + NB + 1 ≤ C

Conclusions Compilers that  implement well known transformations  use models to choose parameter values can achieve performance close to ATLAS There is room for improvement in both models and empirical search  Both are 20-25% slower than BLAS  Higher levels of memory hierarchy cannot be neglected

Future Work Do similar experiments with FFTW  FFTW uses search to choose between algorithms, not to find values for parameters Use models to speed up empirical search in ATLAS  Effectively combining models with search Understand what is missing from models and refine them Study hand-written BLAS codes to understand the gap Feed these insights back into compilers  Better models  Some search if needed How do we make it easier for compiler writers to implement such transformations?