Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
CS189A/172 - Winter 2008 Lecture 7: Software Specification, Architecture Specification.
Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,
From Cooper & Torczon1 Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code)
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Invitation to Computer Science 5th Edition
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
CS1Q Computer Systems Lecture 3 Simon Gay. Lecture 3CS1Q Computer Systems - Simon Gay2 Where we are Global computing: the Internet Networks and distributed.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
Short Vector SIMD Code Generation for DSP Algorithms
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
What’s in an optimizing compiler?
1 History of compiler development 1953 IBM develops the 701 EDPM (Electronic Data Processing Machine), the first general purpose computer, built as a “defense.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Generative Programming. Automated Assembly Lines.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Introduction CPSC 388 Ellen Walker Hiram College.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.
Advanced Computer Systems
Chapter 1 Introduction.
Introduction to Advanced Topics Chapter 1 Text Book: Advanced compiler Design implementation By Steven S Muchnick (Elsevier)
Empirical Search and Library Generators
Chapter 1 Introduction.
Automatic Performance Tuning
Compiler Construction
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Benjamin Goldberg Compiler Verification and Optimization
Presentation transcript:

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana- Champaign

2 Program optimization today The optimization phase of a compiler applies a series of transformations to achieve its objectives. The compiler uses the outcome of program analysis to determine which transformations are correctness-preserving. Compiler transformation and analysis techniques are reasonably well-understood. Since many of the compiler optimization problems have “exponential complexity”, heuristics are needed to drive the application of transformations.

3 Optimization drivers Developing driving heuristics is laborious. One reason for this is the lack of methodologies and tools to build optimization drivers. As a result, although there is much in common among compilers, their optimization phases are usually re- implemented from scratch.

4 Optimization drivers (Cont.) A consequence: Machines and languages not widely popular usually lack good compilers. (some popular systems too) –DSP, network processor, and embedded system programming is often done in assembly language. –Evaluation of new architectural features requiring compiler involvement is not always meaningful. –Languages such as APL, MATLAB, LISP, … suffer from chronic low performance. –New languages difficult to introduce (although compilers are only a part of the problem).

5 A methodology based on the notion of search space Program transformations often have several possible target versions. –Loop unrolling: How many times –Loop tiling: size of the tile. –Loop interchanging: order of loop headers –Register allocation: which registers are stored in memory to give room for new values. The process of optimization can be seen as a search in the space of possible program versions.

6 Empirical search Iterative compilation Perhaps the simplest application of the search space model is empirical search where several versions are generated and executed on the target machine. The fastest version is selected. T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, and H.A.G. Wijshoff. Iterative compilation in program optimization. In Proc. CPC2000, pages 35-44, 2000

7 Empirical search and traditional compilers Searching is not a new approach and compilers have applied it in the past, but using architectural prediction models instead of actual runs: –KAP searched for best loop header order –SGI’s MIPS-pro and IBM PowerPC compilers select the best degree of unrolling.

8 Limitations of empirical search Empirical search is conceptually simple and portable. However, –the search space tends to be too large specially when several transformations are combined. –It is not clear how to apply this method when program behavior is a function of the input data set. Need heuristics/search strategies. Availability of performance “formulas” could help evaluate transformations across input data sets and facilitate search.

9 Compilers and Library Generators Source Program Internal representation Algorithm Program Transformation Program Generation

10 Empirical search in program/library generators Examples: –FFTW [M. Frigo, S. Johnson] –Spiral (FFT/signal processing) [J. Moura (CMU), M. Veloso (CMU), J. Johnson (Drexel), …] –ATLAS (linear algebra)(R. Whaley, A. Petitet, J. Dongarra) –PHiPAC[J. Demmel et al]

11

12 SPIRAL The approach: –Mathematical formulation of signal processing algorithms –Automatically generate algorithm versions –A generalization of the well-known FFTW –Use compiler technique to translate formulas into implementations –Adapt to the target platform by searching for the optimal version

13

14 Fast DSP Algorithms As Matrix Factorizations Computing y = F 4 x is carried out as: t 1 = A 4 x ( permutation ) t 2 = A 3 t 1 ( two F 2 ’s ) t 3 = A 2 t 2 ( diagonal scaling ) y = A 1 t 3 ( two F 2 ’s ) The cost is reduced because A 1, A 2, A 3 and A 4 are structured sparse matrices.

15 Tensor Product Formulation of Cooley-Tuckey Theorem Example is a diagonal matrix is a stride permutation

16 Formulas for Matrix Factorizations R1 where n = n 1 …n k, n i- = n 1 …n i-1, n i+ = n i+1 …n k R2R2

17 Factorization Trees F2F2 F2F2 F2F2 F 8 : R 1 F 4 : R 1 F2F2 F2F2 F2F2 F 8 : R 1 F 4 : R 1 F2F2 F2F2 F2F2 F 8 : R 2 Different computation order Different data access pattern Different performance

18 Walsh-Hadamard Transform

19 Optimal Factorization Trees Depend on the platform Difficult to deduct Can be found by empirical search –The search space is very large –Different search algorithms Random, DP, GA, hill-climbing, exhaustive

20

21

22 Size of Search Space N# of formulasN

23

24

25 More Search Choices Programming: –Loop unrolling –Memory allocation –In-lining Platform choices: –Compiler optimization options

26 The SPIRAL System Formula Generator SPL Compiler Performance Evaluation Search Engine DSP Transform Target machine DSP Library SPL Program C/FORTRAN Programs

27 Spiral Spiral does the factorization at installation time and generates one library routine for each size. FFTW only generates codelets (input size  64) and at run time performs the factorization.

28 A Simple SPL Program DefinitionDirectiveFormulaComment ; This is a simple SPL program (define A (matrix(1 2)(2 1))) (define B (diagonal(3 3)) #subname simple (tensor (I 2)(compose A B)) ;; This is an invisible comment

29 Templates (template (F n)[ n >= 1 ] ( do i=0,n-1 y(i)=0 do j=0,n-1 y(i)=y(i)+W(n,i*j)*x(j) end end )) Pattern I-code Condition

30 SPL Compiler Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Abstract Syntax Tree I-Code FORTRAN, C Template Table SPL FormulaTemplate Definition Optimization I-Code

31 Intermediate Code Restructuring Loop unrolling –Degree of unrolling can be controlled globally or case by case Scalar function evaluation –Replace scalar functions with constant value or array access Type conversion –Type of input data: real or complex –Type of arithmetic: real or complex –Same SPL formula, different C/Fortran programs

32

33 Optimizations SPL Compiler C/Fortran Compiler Formula Generator * High-level scheduling * Loop transformation * High-level optimizations - Constant folding - Copy propagation - CSE - Dead code elimination * Low-level optimizations - Instruction scheduling - Register allocation

34 Basic Optimizations (FFT, N=2 5, SPARC, f77 –fast –O5)

35 Basic Optimizations (FFT, N=2 5, MIPS, f77 –O3 )

36 Basic Optimizations (FFT, N=2 5, PII, g77 –O6 –malign-double)

37 Performance Evaluation Evaluation the performance of the code generated by the SPL compiler Platforms: SPARC, MIPS, PII Search strategy: dynamic programming

38 Pseudo MFlops Estimation of the # of FP operations: –FFT (radix-2): 5nlog 2 n –

39 FFT Performance (N=2 1 to 2 6 ) SPARCMIPS PII

40 FFT Performance (N=2 7 to 2 20 ) SPARCMIPS PII

41 Important Questions What lessons can be learned from this work? Can this approach be used in other domains ?

42