Genetic Programming Applied to Compiler Optimization

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Goal: Reduce the Penalty of Control Hazards
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
RISC Architecture RISC vs CISC Sherwin Chan.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Improving Compiler Heuristics with Machine Learning Mark Stephenson Una-May O’Reilly Martin C. Martin Saman Amarasinghe Massachusetts Institute of Technology.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Automatic Feature Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
Optimization Problems
Chapter Six.
Topics to be covered Instruction Execution Characteristics
Code Optimization Overview and Examples
Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area
Code Optimization.
Advanced Architectures
A Closer Look at Instruction Set Architectures
Dynamic Branch Prediction
Central Processing Unit Architecture
A Closer Look at Instruction Set Architectures
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Optimizing Compilers Background
The compilation process
Medical Diagnosis via Genetic Programming
CS 704 Advanced Computer Architecture
Improving Compiler Heuristics with Machine Learning
CDA 3101 Spring 2016 Introduction to Computer Organization
CSL718 : VLIW - Software Driven ILP
CS2100 Computer Organisation
Instruction Scheduling for Instruction-Level Parallelism
Improving Compiler Heuristics with Machine Learning
CSCI1600: Embedded and Real Time Software
Optimization Problems
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Chapter Six.
September 17 Test 1 pre(re)view Fang-Yi will demonstrate Spim
Analysis of Algorithms
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Genetic Programming Applied to Compiler Optimization
Predicting Unroll Factors Using Supervised Classification
Chapter 12 Pipelining and RISC
Analysis of Algorithms
Lecture 4: Instruction Set Design/Pipelining
DESIGN OF EXPERIMENTS by R. C. Baker
rePLay: A Hardware Framework for Dynamic Optimization
CSc 453 Final Code Generation
COMPUTER ORGANIZATION AND ARCHITECTURE
CSCI1600: Embedded and Real Time Software
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Analysis of Algorithms
Presentation transcript:

Genetic Programming Applied to Compiler Optimization Mark Stephenson, Una-May O’Reilly, Martin C. Martin, and Saman Amarasinghe Massachusetts Institute of Technology 12/4/2018

An Anatomy of a Compiler High-level program Optimized instructions Propagation Constant Unrolling Loop Scheduling Instruction Generation Code … Take a high-level specification, and produce “code” that can be run on a given architecture. Compiler optimizations are almost never optimal. 12/4/2018

System Complexities Compiler complexity Open Research Compiler ~3.5 million lines of C/C++ code Trimaran’s compiler ~ 800,000 lines of C code Lots of stages with complicated interactions between them Not to mention the target architectures Pentium® processor 3.1 million transistors Pentium 4 processor 55 million transistors Mention compiler passes and their interactions: give an example of instruction scheduling and register allocation. Say how they interact and that they are interdependent 12/4/2018

Micro-Architectures Change If the target architecture changes, the compiler needs to change Performance of your software depends on the quality of your compiler 12/4/2018

NP-Completeness Many compiler optimizations are NP-complete Compiler writers rely on heuristics In practice, heuristics perform well …but, require a lot of tweaking Heuristics often have a focal point Rely on a single priority function 12/4/2018

Priority Functions A heuristic’s Achilles heel A single priority or cost function often dictates the efficacy of a heuristic Priority functions rank the options available to a compiler heuristic Give an example of a priority function for instruction scheduling. 12/4/2018

Qualities of Priority Functions Can focus on a small portion of an optimization algorithm Small change can yield big payoffs Clear specification in terms of input/output Prevalent in compiler heuristics Perfectly matches GP’s representation Priority functions are a great place to apply GP. We make no changes to the compiler’s underlying algorithm, which among other things enforces the legality of the optimization. 12/4/2018

Further Considerations Who knows what target architecture the priority function was written for (or in what decade)? If it was adequately optimized by the designer (for the applications we care about)? If it ‘knows’ about the other optimizations the compiler performs? 12/4/2018

An Example Optimization Hyperblock Scheduling Conditional execution is potentially very expensive on a modern architecture Modern processors try to dynamically predict the outcome of the condition This works great for predictable branches… But some conditions can’t be predicted If they don’t predict correctly you waste a lot of time 12/4/2018

Example Optimization Hyperblock Scheduling Assume a[1] is 0 Machine code Modern architectures start executing instructions before they know whether or not they need to! Oops, it mispredicted if (a[1] == 0) else Fix the white on white arrows. 12/4/2018

Example Optimization Hyperblock Scheduling Machine code if (a[1] == 0) else Solution: simultaneously execute both conditions and simply discard the results of the instructions that weren’t supposed to be run. The combined sections of code are called a hyperblock. All instructions in a hyperblock are executed 12/4/2018

Example Optimization Hyperblock Scheduling There are unclear tradeoffs In some situations, hyperblocks are faster than traditional execution In others, hyperblocks impair performance If a condition is highly predictable, there’s probably no reason to form a hyperblock 12/4/2018

Trimaran’s Priority Function Favor short code segments Favor frequently Executed code Trimaran is a research compiler that we used to collect experimental results. It’s a very mature system though that has been shown to be competitive with the best proprietary compilers. Here’s the priority function that Trimaran uses to select which code segments to merge. The code segments with the highest priorities are merged into a hyperblock. Penalize codes with hazards Favor parallel code 12/4/2018

Our Approach What are the important characteristics of a hyperblock formation priority function? Trimaran uses four characteristics Our approach: Extract all the characteristics you can think of and have GP find the priority function 12/4/2018

Hyperblock Formation GP Terminals Maximum ops over segments Dependence height Number of code segments Number of operations Does segment have subroutine calls? Number of branches Does segment have unsafe calls? Execution ratio Does code have pointer derefs? Average ops executed in code segment Issue width of processor Average predictability of branches in segment … Predictability product of branches in segment These are some of the terminals that GP uses, and we use a standard set of arithmetic operators and constants. The result of each subexpression is either a boolean or a real value. Therefore these are reals or booleans. 12/4/2018

General Flow Create initial population (initial solutions) Vanilla GP system Randomly generated initial population seeded with the compiler writer’s best guess Evaluation done? One of the individuals is Trimaran’s priority function, the other 399 are randomly generated. Put the details in the slide Selection Create Variants 12/4/2018

General Flow Create initial population (initial solutions) Evaluation Each expression is evaluated by compiling and running the benchmark(s) Fitness is the relative speedup over Trimaran’s priority function on the benchmark(s) We add parsimony pressure to favor more readable expressions Use Dynamic Subset Selection [Gathercole] Create initial population (initial solutions) Evaluation done? Compiling a program and running it is time consuming, so we use dynamic subset selection and only focus on a subset of the benchmarks at a time. Selection Create Variants 12/4/2018

GP Settings Parameter Setting Generations 50 Population Size 400 Tournament Size 7 Replacement Rate 22% Mutation Rate 5% DSS Set Size 4, 5, 6 Training Set Size 12 12/4/2018

Goal of an Optimizing Compiler A.c B.c C.c D.c Compiler 1 2 A B C D 12/4/2018

A Simpler Problem Application-Specific Compilers A.c B.c C.c D.c Compiler 1 2 A B C D 12/4/2018

Hyperblock Results Application-Specific Compilers 3.5 Training input Novel input 3 (add (sub (cmul (gt (cmul $b0 0.8982 $d17)…$d7)) (cmul $b0 0.6183 $d28))) 2.5 (add (div $d20 $d5) (tern $b2 $d0 $d9)) 2 Speedup 1.5 1.54 1 1.23 The benchmarks listed on the x-axis are from a couple of different benchmark suites,namely,spec95 and mediabench. Fitness case is a benchmark plus its input. Train on individual, then present DSS results. 0.5 toast huff_dec huff_enc Average rawdaudio g721encode rawcaudio mpeg2dec 129.compress g721decode 12/4/2018

Hyperblock Results General-Purpose Compiler 12/4/2018

Cross Validation Testing General-Purpose Applicability 12/4/2018

Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Intron that doesn’t affect solution 12/4/2018

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor paths that don’t have pointer dereferences 12/4/2018

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor highly parallel (fat) paths 12/4/2018

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) If a path calls a subroutine that may have side effects, penalize it 12/4/2018

Future Work Apply these techniques to a real machine Intel Itanium Using the Open Research Compiler Investigate our solutions thoroughly Our results were collected on a simulator. 12/4/2018

Conclusion GP can identify effective priority functions ‘Proof of concept’ by evolving two well known priority functions Take a huge compiler, optimize one priority function with GP and get nice speedups The compiler community is interested (Programming Language Design and Implementation ’03) Present the speedup here. 12/4/2018