Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.

Autotuning at Illinois María Jesús Garzarán University of Illinois.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Data Locality CS 524 – High-Performance Computing.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Data Locality CS 524 – High-Performance Computing.

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Short Vector SIMD Code Generation for DSP Algorithms

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Generative Programming. Automated Assembly Lines.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Introduction CPSC 388 Ellen Walker Hiram College.

Chapter# 6 Code generation.  The final phase in our compiler model is the code generator.  It takes as input the intermediate representation(IR) produced.

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Empirical Search and Library Generators

Optimization Code Optimization ©SoftMoore Consulting.

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Automatic Performance Tuning

Memory Hierarchies.

Optimizing MMM & ATLAS Library Generator

A Comparison of Cache-conscious and Cache-oblivious Codes

What does it take to produce near-peak Matrix-Matrix Multiply

Cache-oblivious Programming

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

An analytical model for ATLAS

Presentation transcript:

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign

Libraries and Productivity Libraries help productivity. But not always. –Not all algorithms implemented. –Not all data structures. In any case, much effort goes into highly-tuned libraries. Automatic generation of libraries libraries would –Reduce cost of developing libraries –For a fixed cost, enable a wider range of implementations and thus make libraries more usable.

An Illustration based on MATLAB of the effect of libraries on performance

Compilers vs. Libraries in Sorting ~2X

Compilers versus libraries in DFT

Compilers vs. Libraries in Matrix-Matrix Multiplication (MMM)

Library Generators Automatically generate highly efficient libraries for a class platforms. No need to manually tune the library to the architectural characteristics of a new machine.

Library Generators (Cont.) Examples: –In linear algebra: ATLAS, PhiPAC –In signal processing: FFTW, SPIRAL Library generators usually handle a fixed set of algorithms. Exception: SPIRAL accepts formulas and rewriting rules as input.

Library Generators (Cont.) At installation time, LGs apply empirical optimization. –That is, search for the best version in a set of different implementations –Number of versions astronomical. Heuristics are needed.

Library Generators (Cont.) LGs must output C code for portability. Unenven quality of compilers => –Need for source-to-source optimizers –Or incorporate in search space variations introduced by optimizing compilers.

Library Generators (Cont.) Generator C function Source-to-source optimizer C function Native compiler Algorithm description Object code Execution performance Final C function

Important research issues Reduction of the search space with minimal impact on performance Adaptation to the input data (not needed for dense linear algebra) More flexible of generators –algorithms –data structures –classes of target machines Tools to build library generators.

Library generators and compilers LGs are a good yardstick for compilers Library generators use compilers. Compilers could use library generator techniques to optimize libraries in context. Search strategies could help design better compilers - –Optimization strategy: Most important open problem in compilers.

Organization of a library generation system High Level Specification (Domain Specific Language (DSL)) SIGNAL PROCESSING FORMULA LINEAR ALGEBRA ALGORITHM IN FUNCTIONAL LANGUAGE NOTATION PARAMETERIZATION FOR SIGNAL PROCESSING PARAMETERIZATION PROGRAM GENERATOR FOR SORTING PARAMETERIZATION FOR LINEAR ALGEBRA X code with search directives Backend compiler ExecutableRun Selection Strategy Reflective optimization

Three library generation projects 1.Spiral and the impact of compilers 2.ATLAS and analytical model 3.Sorting and adapting to the input

Spiral: A code generator for digital signal processing transforms Joint work with: Jose Moura (CMU), Markus Pueschel (CMU), Manuela Veloso (CMU), Jeremy Johnson (Drexel)

SPIRAL The approach: –Mathematical formulation of signal processing algorithms –Automatically generate algorithm versions –A generalization of the well-known FFTW –Use compiler technique to translate formulas into implementations –Adapt to the target platform by searching for the optimal version

Fast DSP Algorithms As Matrix Factorizations Computing y = F 4 x is carried out as: t 1 = A 4 x ( permutation ) t 2 = A 3 t 1 ( two F 2 ’s ) t 3 = A 2 t 2 ( diagonal scaling ) y = A 1 t 3 ( two F 2 ’s ) The cost is reduced because A 1, A 2, A 3 and A 4 are structured sparse matrices.

General Tensor Product Formulation Theorem Example is a diagonal matrix is a stride permutation

Factorization Trees F2F2 F2F2 F2F2 F 8 : R 1 F 4 : R 1 F2F2 F2F2 F2F2 F 8 : R 1 F 4 : R 1 F2F2 F2F2 F2F2 F 8 : R 2 Different computation order Different data access pattern Different performance

The SPIRAL System Formula Generator SPL Compiler Performance Evaluation Search Engine DSP Transform Target machine DSP Library SPL Program C/FORTRAN Programs

SPL Compiler Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Abstract Syntax Tree I-Code FORTRAN, C Template Table SPL FormulaTemplate Definition Optimization I-Code

Optimizations SPL Compiler C/Fortran Compiler Formula Generator * High-level scheduling * Loop transformation * High-level optimizations - Constant folding - Copy propagation - CSE - Dead code elimination * Low-level optimizations - Instruction scheduling - Register allocation

Basic Optimizations (FFT, N=2 5, SPARC, f77 –fast –O5)

Basic Optimizations (FFT, N=2 5, MIPS, f77 –O3 )

Basic Optimizations (FFT, N=2 5, PII, g77 –O6 –malign-double)

Overall performance

An analytical model for ATLAS Joint work with Keshav Pingali (Cornell) Gerald DeJong Maria Garzaran

ATLAS ATLAS = Automated Tuned Linear Algebra Software, developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee. ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). –Use search to adapt to the target machine

ATLAS Infrastructure Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch)

Detecting Machine Parameters Micro-benchmarks –L1Size: L1 Data Cache size Similar to Hennessy-Patterson book –NR: Number of registers Use several FP temporaries repeatedly –MulAdd: Fused Multiply Add (FMA) “c+=a*b” as opposed to “c+=t; t=a*b” –Latency: Latency of FP Multiplication Needed for scheduling multiplies and adds in the absence of FMA

Compiler View ATLAS Code Generation Focus on MMM (as part of BLAS-3) –Very good reuse O(N 2 ) data, O(N 3 ) computation –No “real” dependecies (only input / reuse ones) Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS ATLAS MM Code Generator (MMCase)

Adaptations/Optimizations Cache-level blocking (tiling) –Atlas blocks only for L1 cache Register-level blocking –Highest level of memory hierarchy –Important to hold array values in registers Software pipelining –Unroll and schedule operations Versioning –Dynamically decide which way to compute

Cache-level blocking (tiling) Tiling in ATLAS –Only square tiles (NBxNBxNB) –Working set of tile fits in L1 –Tiles are usually copied to continuous storage –Special “clean-up” code generated for bounderies Mini-MMM for (int j = 0; j < NB ; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter

Register-level blocking Micro-MMM –MUx1 elements of A –1xNU elements of B –MUxNU sub-matrix of C –MU*NU + MU + NU ≤ NR Mini-MMM revised for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Unroll K look KU times MU, NU, KU: optimization parameters

Scheduling FMA Present? Schedule Computation –Using Latency Schedule Memory Operations –Using FFetch, IFetch, NFetch Mini-MMM revised for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k += KU) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s... load A[i..i+MU-1,k+KU-1] into registers load B[k+KU-1,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Latency, xFetch: optimization parameters KU times M1M1 M2M2 M3M3 M4M4 M MU*NU … A1A1 A2A2 A3A3 A4A4 A MU*NU … L1L1 L2L2 L3L3 L MU+NU … Latency=2 A1A1 A2A2 A MU*NU … Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations Computation Memory Operations IFetch Loads NFetch Loads …

Searching for Optimization Parameters ATLAS Search Engine Multi-dimensional search problem –Optimization parameters are independent variables –MFLOPS is the dependent variable –Function is implicit but can be repeatedly evaluated Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS ATLAS Search Engine (MMSearch)

Search Strategy Orthogonal Range Search –Optimize along one dimension at a time, using reference values for not-yet-optimized parameters –Not guaranteed to find optimal point –Input Order in which dimensions are optimized –NB, MU & NU, KU, xFetch, Latency Interval in which search is done in each dimension –For NB it is, step 4 Reference values for not-yet-optimized dimensions –Reference values for KU during NB search are 1 and NB

Modeling for Optimization Parameters Our Modeling Engine Optimization parameters –NB: Hierarchy of Models (later) –MU, NU: –KU: maximize subject to L1 Instruction Cache –Latency, MulAdd: from hardware parameters –xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Model

Modeling for Tile Size (NB) Models of increasing complexity –3*NB 2 ≤ C Whole work-set fits in L1 –NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word – or Line Size > 1 word – or LRU Replacement

Experiments Architectures: –SGI R12000, 270MHz –Sun UltraSPARC III, 900MHz –Intel Pentium III, 550MHz Measure –Mini-MMM performance –Complete MMM performance –Sensitivity to variations on parameters

MiniMMM Performance SGI –ATLAS: 457 MFLOPS –Model:453 MFLOPS –Difference:1% Sun –ATLAS:1287 MFLOPS –Model:1052 MFLOPS –Difference:20% Intel –ATLAS:394 MFLOPS –Model:384 MFLOPS –Difference:2%

MMM Performance SGISun Intel BLAS COMPILER ATLAS MODEL

Sensitivity to NB and Latency on Sun Tile Size (NB) MU & NU, KU, Latency, xFetch for all architectures Latency

Sensitivity to NB on SGI 3*NB 2 ≤ C NB 2 + NB + 1 ≤ C

Sorting Joint work with Maria Garzaran Xiaoming Li

ESSL on Power3

ESSL on Power4

Motivation No universally best sorting algorithm Can we automatically GENERATE and tune sorting algorithms for each platform ? Performance of sorting depends not only on the platform but also on the input characteristics.

A firs strategy: Algorithm Selection Select the best algorithm from Quicksort, Multiway Merge Sort and CC-radix. Relevant input characteristics: number of keys, entropy vector.

Algorithm Selection

A better Solution We can use different algorithms for different partitions Build Composite Sorting algorithms –Identify primitives from the sorting algorithms –Design a general method to select an appropriate sorting primitive at runtime –Design a mechanism to combine the primitives and the selection methods to generate the composite sorting algorithm

Sorting Primitives Divide-by-Value –A step in Quicksort –Select one or multiple pivots and sort the input array around these pivots –Parameter: number of pivots Divide-by-Position (DP) –Divide input into same-size sub-partitions –Use heap to merge the multiple sorted sub-partitions –Parameters: size of sub-partitions, fan-out and size of the heap

Sorting Primitives Divide-by-Radix (DR) –Non-comparison based sorting algorithm –Parameter: radix (r bits)

Selection Primitives Branch-by-Size Branch-by-Entropy –Parameter: number of branches, threshold vector of the branches

Leaf Primitives When the size of a partition is small, we stick to one algorithm to sort the partition fully. Two methods are used in the cleanup operation –Quicksort –CC-Radix

Composite Sorting Algorithms Composite sorting algorithms are built with these primitives. Algorithms are represented as trees.

Performance of Classifier Sorting Power3

Power4