Empirical Search and Library Generators

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

Autotuning at Illinois María Jesús Garzarán University of Illinois.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Data Locality CS 524 – High-Performance Computing.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Data Locality CS 524 – High-Performance Computing.

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Short Vector SIMD Code Generation for DSP Algorithms

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Generative Programming. Automated Assembly Lines.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Compiler Design (40-414) Main Text Book:

Ioannis E. Venetis Department of Computer Engineering and Informatics

A Comparison of Cache-conscious and Cache-oblivious Programs

SOFTWARE DESIGN AND ARCHITECTURE

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Optimization Code Optimization ©SoftMoore Consulting.

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Vector Processing => Multimedia

COMP4211 : Advance Computer Architecture

Automatic Performance Tuning

for more information ... Performance Tuning

Memory Hierarchies.

Automatic Measurement of Instruction Cache Capacity in X-Ray

Objective of This Course

CS 380C: Advanced Compiler Techniques

Register Pressure Guided Unroll-and-Jam

Optimizing MMM & ATLAS Library Generator

A Comparison of Cache-conscious and Cache-oblivious Codes

Siddhartha Chatterjee

What does it take to produce near-peak Matrix-Matrix Multiply

Cache-oblivious Programming

Lecture 4: Instruction Set Design/Pipelining

6- General Purpose GPU Programming

Optimizing single thread performance

The Price of Cache-obliviousness

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

An analytical model for ATLAS

Presentation transcript:

Empirical Search and Library Generators

Libraries and Productivity Building libraries is one of the earliest strategies to improve productivity. Libraries are particularly important for performance High performance is difficult to attain and not portable.

Compilers vs. Libraries in Sorting ~2X ~2X

Compilers versus libraries in DFT

Compilers vs. Libraries in Matrix-Matrix Multiplication (MMM)

Libraries and Productivity Libraries are not used as often as it is believed. Not all algorithms implemented. Not all data structures. In any case, much effort goes into highly-tuned libraries. Automatic generation of libraries would Reduce implementation cost For a fixed cost, enable a wider range of implementations and thus make libraries more usable.

Library Generators Automatically generate highly efficient libraries for a class platforms. No need to manually tune the library to the architectural characteristics of a new machine.

Library Generators (Cont.) Examples: In linear algebra: ATLAS, PhiPAC In signal processing: FFTW, SPIRAL Library generators usually handle a fixed set of algorithms. Exception: SPIRAL accepts formulas and rewriting rules as input.

Library Generators (Cont.) At installation time, LGs conduct an empirical search. That is, search for the best version in a set of different implementations Number of versions astronomical. Heuristics are needed.

Library Generators (Cont.) LGs must output C code for portability. Unenven quality of compilers => Need for source-to-source optimizers Or incorporate in search space variations introduced by optimizing compilers.

Library Generators (Cont.) Algorithm description Generator Search strategy C function Final C function Source-to-source optimizer performance C function Native compiler Execution Object code

Important research issues Reduction of the search space with minimal impact on performance Adaptation to the input data (not needed for dense linear algebra) More flexible of generators algorithms data structures classes of target machines Tools to build library generators.

Library generators and compilers Code generated by LGs useful as an absolute measurement of compilers Library generators use compilers. Compilers could use library generator techniques to optimize libraries in context. Search strategies could help design better compilers - Optimization strategy: Most important open problem in compilers.

Organization of a library generation system LINEAR ALGEBRA ALGORITHM IN FUNCTIONAL LANGUAGE NOTATION High Level Specification (Domain Specific Language (DSL)) SIGNAL PROCESSING FORMULA PARAMETERIZATION FOR SIGNAL PROCESSING PARAMETERIZATION FOR LINEAR ALGEBRA PARAMETERIZATION PROGRAM GENERATOR FOR SORTING X code with search directives Reflective optimization Backend compiler Selection Strategy Run Executable

Three library generation projects Spiral and the impact of compilers ATLAS and analytical model Sorting and adapting to the input

Spiral: A code generator for digital signal processing transforms Joint work with: Jose Moura (CMU), Markus Pueschel (CMU), Manuela Veloso (CMU), Jeremy Johnson (Drexel)

SPIRAL The approach: Mathematical formulation of signal processing algorithms Automatically generate multiple formulas using rewriting rules More flexible than the well-known FFTW

Fast DSP Algorithms As Matrix Factorizations Computing y = F4 x is carried out as: t1 = A4 x ( permutation ) t2 = A3 t1 ( two F2’s ) t3 = A2 t2 ( diagonal scaling ) y = A1 t3 ( two F2’s ) The cost is reduced because A1, A2, A3 and A4 are structured sparse matrices.

General Tensor Product Formulation is a diagonal matrix is a stride permutation

Factorization Trees F8 : R1 F8 : R1 F4 : R1 F2 F2 F4 : R1 F8 : R2 F2 Different computation orders Different data access patterns F2 F8 : R1 F4 : R1 F2 F8 : R1 F4 : R1 Different performance F2 F8 : R2

Performance Evaluation The SPIRAL System DSP Transform Formula Generator SPL Program Search Engine SPL Compiler C/FORTRAN Programs Performance Evaluation DSP Library Target machine

SPL Compiler Parsing Intermediate Code Generation SPL Formula Template Definition Parsing Abstract Syntax Tree Template Table Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Optimization I-Code Target Code Generation FORTRAN, C

Optimizations Formula Generator SPL Compiler C/Fortran Compiler * High-level scheduling * Loop transformation Formula Generator * High-level optimizations - Constant folding - Copy propagation - CSE - Dead code elimination SPL Compiler C/Fortran Compiler * Low-level optimizations - Instruction scheduling - Register allocation

Basic Optimizations (FFT, N=25, SPARC, f77 –fast –O5)

Basic Optimizations (FFT, N=25, MIPS, f77 –O3)

Basic Optimizations (FFT, N=25, PII, g77 –O6 –malign-double)

Overall performance

SPIRAL Relies on the divide and conquer nature of the algorithms it implements. Compiler techniques are of great importance. Still room for improvement.

An analytical model for ATLAS Joint work with Keshav Pingali (Cornell) Gerald DeJong (UIUC)

ATLAS ATLAS = Automated Tuned Linear Algebra Software, developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee. ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). Use search to adapt to the target machine

ATLAS Infrastructure Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS Search Engine (MMSearch) ATLAS MM Code Generator (MMCase) Validate that the code ATLAS produces can be viewed at as a result of performing well-known compiler transformations on the vanilla version of matrix multiply.

Detecting Machine Parameters Micro-benchmarks L1Size: L1 Data Cache size Similar to Hennessy-Patterson book NR: Number of registers Use several FP temporaries repeatedly MulAdd: Fused Multiply Add (FMA) “c+=a*b” as opposed to “c+=t; t=a*b” Latency: Latency of FP Multiplication Needed for scheduling multiplies and adds in the absence of FMA Say only about L1Size and NR. Rest is fully explained in the paper.

Compiler View ATLAS Code Generation Focus on MMM (as part of BLAS-3) Very good reuse O(N2) data, O(N3) computation Much room for transformations Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS ATLAS MM Code Generator (MMCase)

Adaptations/Optimizations Cache-level blocking (tiling) Atlas blocks only for L1 cache Register-level blocking Highest level of memory hierarchy Important to hold array values in registers Software pipelining Unroll and schedule operations ATLAS code can be vied as a result of application of these well known compiler transformations to the vanilla version of MMM.

Cache-level blocking (tiling) Tiling in ATLAS Only square tiles (NBxNBxNB) Working set of tile fits in L1 Tiles are usually copied to continuous storage Special “clean-up” code generated for boundaries Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] NB: Optimization parameter ATLAS uses square tiles only, such that the working set of the loop nest fits in the L1 data cache. Tiles are usually copied to sequential memory. The size of the tile NB is an optimization parameter.

Register-level blocking Micro-MMM MUx1 elements of A 1xNU elements of B MUxNU sub-matrix of C MU*NU + MU + NU ≤ NR Mini-MMM revised for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Unroll K look KU times MU, NU, KU: optimization parameters Multiplies a MUx1 vector of A times 1xNU vector of B into a MUxNU submatrix of C. These are unroll factors for the appropriate loops and the different array elements are assigned to different floating point registers. The total number of registers needed for this operation is MU*NU+MU+NU. Last the third loop is unrolled by a factor of KU (not needing any extra registers). MU, NU, and KU are optimization parameters.

Scheduling FMA Present? Schedule Computation Using Latency Schedule Memory Operations Using FFetch, IFetch, NFetch Mini-MMM revised for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k += KU) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s ... load A[i..i+MU-1,k+KU-1] into registers load B[k+KU-1,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] Latency, xFetch: optimization parameters M1 M2 M3 M4 MMU*NU … A1 A2 A3 A4 AMU*NU … Latency=2 L1 L2 L3 A1 A2 AMU*NU … … LMU+NU KU times

Searching for Optimization Parameters ATLAS Search Engine Multi-dimensional search problem Optimization parameters are independent variables MFLOPS is the dependent variable Function is implicit but can be repeatedly evaluated Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS ATLAS Search Engine (MMSearch) Optimize MFLOPS as an implicit function of all the optimization parameters discussed so far.

Search Strategy Orthogonal Range Search Optimize along one dimension at a time, using reference values for not-yet-optimized parameters Not guaranteed to find optimal point Input Order in which dimensions are optimized NB, MU & NU, KU, xFetch, Latency Interval in which search is done in each dimension For NB it is , BY 4 Reference values for not-yet-optimized dimensions Reference values for KU during NB search are 1 and NB

Modeling for Optimization Parameters Our Modeling Engine Optimization parameters NB: Hierarchy of Models (later) MU, NU: KU: maximize subject to L1 Instruction Cache Latency, MulAdd: from hardware parameters xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch NB MU,NU,KU MiniMMM Source L1Size Model Say that we will look into mode detail at the way we model for NB. Rest in the paper.

Modeling for Tile Size (NB) Models of increasing complexity 3*NB2 ≤ C Whole work-set fits in L1 NB2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word Line Size > 1 word LRU Replacement

Experiments Architectures: Measure SGI R12000, 270MHz Sun UltraSPARC IIIi, 1060MHz Intel Pentium III, 550MHz Measure Mini-MMM performance Complete MMM performance Sensitivity to variations on parameters

MMM Performance SGI Intel BLAS COMPILER ATLAS MODEL

ATLAS Issues Model worked well in our experiments. Models avoid the need for search and enables better optimization. But… developing the model is not easy Can it be automated ? However, it is not clear how far pure models would work. A hybrid approach is probably best.

Joint work with Xiaoming Li Sorting Joint work with Xiaoming Li

ESSL on Power3 The x axis is the standard deviation. When the standard deviation is low, it means the input values concentrate around some value. If the sdev Is high, it means data are spread out. We want to see how the performance of algorithms change with input characteristics. ESSL is better than the other pure sorting algorithms. ESSL is very stable. The two algorithms (Quicksort & CC-radix from UPC) are optimized versions of Quicksort and CC-radix.

ESSL on Power4

Motivation No universally best sorting algorithm Can we automatically GENERATE and tune sorting algorithms for each platform ? Performance of sorting depends not only on the platform but also on the input characteristics. One characteristics of sorting is that the performance depends not only on the architecture features but also On input characteristics, which is different from routines like MM or FFT.

A first strategy: Algorithm Selection Select the best algorithm from Quicksort, Multiway Merge Sort and CC-radix. Relevant input characteristics: number of keys, entropy vector.

Algorithm Selection The limitation of algorithm selection

A better Solution We can use different algorithms for different partitions Build Composite Sorting algorithms Identify primitives from the sorting algorithms Design a general method to select an appropriate sorting primitive at runtime Design a mechanism to combine the primitives and the selection methods to generate the composite sorting algorithm

Sorting Primitives Divide-by-Value Divide-by-Position (DP) A step in Quicksort Select one or multiple pivots and sort the input array around these pivots Parameter: number of pivots Divide-by-Position (DP) Divide input into same-size sub-partitions Use heap to merge the multiple sorted sub-partitions Parameters: size of sub-partitions, fan-out and size of the heap

Sorting Primitives Divide-by-Radix (DR) Non-comparison based sorting algorithm Parameter: radix (r bits) Non-comparison makes it run fast on super-scalar architectures From MSB.

Selection Primitives Branch-by-Size Branch-by-Entropy Parameter: number of branches, threshold vector of the branches We are using entropy. For this presentation. You can assume sdev is equivalent to ent, the detail is in the paper. (Backup slides: the definition of entropy).

Leaf Primitives When the size of a partition is small, we stick to one algorithm to sort the partition fully. Two methods are used in the cleanup operation Quicksort CC-Radix Leaf primitives -> quicksort, radix sort

Composite Sorting Algorithms Composite sorting algorithms are built with these primitives. Algorithms are represented as trees. We use previous primitives, sorting primitives and selection primitives, to build composite sorting algorithms. We envision the composite algorithm to have the shape of tree.

Performance of Classifier Sorting Power3 Generate two plots with first plot just have gene sort

Power4

Sorting Again divide-and conquer. But could not find formulas like Spiral. Adaptation to input data crucial. Need to deal with other features of the input data – degree of “sortedness”

Conclusions Much exploratory work today General principles are emerging, but much remains to be done. This new exciting area of research should teach us much about program optimization.