In Search of Near-Optimal Optimization Phase Orderings

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
The Assembly Language Level
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Program Representations. Representing programs Goals.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Electrical Engineering & Computer Science - University of Kansas Colloquium / 55 Exhaustive Phase Order Search Space Exploration and Evaluation by Prasad.
Cpeg421-08S/final-review1 Course Review Tom St. John.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Computer & Information Sciences - University of Delaware Colloquium / 55 Exhaustive Phase Order Search Space Exploration and Evaluation by Prasad Kulkarni.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Florida State University Symposium on Code Generation and Optimization Exhaustive Optimization Phase Order Space Exploration Prasad A. Kulkarni.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Florida State University Automatic Tuning of Libraries and Applications, LACSI 2006 In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni.
Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.
Advanced Computer Systems
Code Optimization Overview and Examples
High-level optimization Jakub Yaghob
Code Optimization.
Introduction to Compiler Construction
Compiler Construction (CS-636)
Static Single Assignment
Optimization Code Optimization ©SoftMoore Consulting.
Data Compression.
Compiler Construction
Methodology of a Compiler that Compresses Code using Echo Instructions
Presented by: Sameer Kulkarni
CSCI1600: Embedded and Real Time Software
Machine-Independent Optimization
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Detailed Analysis of MiBench benchmark suite
Validation of Code-Improving Transformations for Embedded Systems
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer
Validation of Code-Improving Transformations for Embedded Systems
Advance Database System
A Small and Fast IP Forwarding Table Using Hashing
8 Code Generation Topics A simple code generator algorithm
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Intermediate Code Generation
Lecture 4: Instruction Set Design/Pipelining
TARGET CODE GENERATION
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
CSCI1600: Embedded and Real Time Software
Presentation transcript:

In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni David B. Whalley Gary S. Tyson Jack W. Davidson Automatic Tuning of Libraries and Applications, LACSI 2006

Optimization Phase Ordering Optimizing compilers apply several optimization phases to improve the performance of applications. Optimization phases interact with each other. Determining the order of applying optimization phases to obtain the best performance has been a long standing problem in compilers. Automatic Tuning of Libraries and Applications, LACSI 2006

Exhaustive Phase Order Evaluation Determine the performance of all possible orderings of optimization phases. Exhaustive phase order evaluation involves generating all distinct function instances that can be produced by changing optimization phase orderings (CGO ’06) determining the dynamic performance of each distinct function instance for each function (LCTES ’06) Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Outline Experimental framework Exhaustive phase order space enumeration Accurately determining dynamic performance Analyzing the DAG of function instances Conclusions Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Outline Experimental framework Exhaustive phase order space enumeration Accurately determining dynamic performance Analyzing the DAG of Function instances Conclusions Automatic Tuning of Libraries and Applications, LACSI 2006

Experimental Framework We used the VPO compilation system established compiler framework, started development in 1988 comparable performance to gcc –O2 VPO performs all transformations on a single representation (RTLs), so it is possible to perform most phases in an arbitrary order. Experiments use all the 15 available optimization phases in VPO. Target architecture was the StrongARM SA-100 processor. Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Benchmarks Used two programs from each of the six MiBench categories, 244 total functions. Category Program Description auto bitcount test processor bit manipulation abilities qsort sort strings using the quicksort sorting algorithm network dijkstra Dijkstra’s shortest path algorithm patricia construct patricia trie for IP traffic telecomm fft fast fourier transform adpcm compress 16-bit linear PCM samples to 4-bit consumer jpeg image compression / decompression tiff2bw convert color tiff image to b & w image security sha secure hash algorithm blowfish symmetic block cipher with variable length key office stringsearch searches for given words in phrases ispell fast spelling checker Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Outline Experimental framework Exhaustive phase order space enumeration Accurately determining dynamic performance Analyzing the DAG of function instances Conclusions Automatic Tuning of Libraries and Applications, LACSI 2006

Exhaustive Phase Order Enumeration Exhaustive enumeration is difficult compilers typically contain many different optimization phases optimizations may be successful multiple times for each function / program On average, we would need to evaluate 1516 different phase orders per function. Automatic Tuning of Libraries and Applications, LACSI 2006

Re-stating the Phase Ordering Problem Many different orderings of optimization phases produce the same code. Rather than considering all attempted phase sequences, the phase ordering problem can be addressed by enumerating all distinct function instances that can be produced by any combination of optimization phases. Automatic Tuning of Libraries and Applications, LACSI 2006

Naive Optimization Phase Order Space All combinations of optimization phase sequences are attempted. L0 a d b c L1 a d a d a d a d b c b c b c b c L2 Automatic Tuning of Libraries and Applications, LACSI 2006

Eliminating Consecutively Applied Phases A phase just applied in our compiler cannot be immediately active again. L0 a d b c L1 d a d a d a b c c b b c L2 Automatic Tuning of Libraries and Applications, LACSI 2006

Eliminating Dormant Phases Get feedback from the compiler indicating if any transformations were successfully applied in a phase. L0 a d b c L1 d a d a d b c c b L2 Automatic Tuning of Libraries and Applications, LACSI 2006

Detecting Identical Function Instances Some optimization phases are independent example: branch chaining & register allocation Different phase sequences can produce the same code r[2] = 1; r[3] = r[4] + r[2]; instruction selection r[3] = r[4] + 1; r[2] = 1; r[3] = r[4] + r[2]; constant propagation r[3] = r[4] + 1; dead assignment elimination Automatic Tuning of Libraries and Applications, LACSI 2006

Detecting Equivalent Function Instances sum = 0; for (i = 0; i < 1000; i++ ) sum += a [ i ]; Source Code r[10]=0; r[12]=HI[a]; r[12]=r[12]+LO[a]; r[1]=r[12]; r[9]=4000+r[12]; L3 r[8]=M[r[1]]; r[10]=r[10]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L3; Register Allocation before Code Motion r[11]=0; r[10]=HI[a]; r[10]=r[10]+LO[a]; r[1]=r[10]; r[9]=4000+r[10]; L5 r[8]=M[r[1]]; r[11]=r[11]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L5; Code Motion before Register Allocation r[32]=0; r[33]=HI[a]; r[33]=r[33]+LO[a]; r[34]=r[33]; r[35]=4000+r[33]; L01 r[36]=M[r[34]]; r[32]=r[32]+r[36]; r[34]=r[34]+4; IC=r[34]?r[35]; PC=IC<0,L01; After Mapping Registers Automatic Tuning of Libraries and Applications, LACSI 2006

Resulting Search Space Merging equivalent function instances transforms the tree to a DAG. L0 a b c L1 a a d d d c L2 Automatic Tuning of Libraries and Applications, LACSI 2006

Techniques to Make Searches Faster Kept a copy of the program representation of the unoptimized function instance in memory to avoid repeated disk accesses. Enumerated the space in a depth-first order keeping in memory the program representation after each active phase to reduce number of phases applied. Reduced search time by at least a factor of 5 to 10. Able to completely enumerate 234 out of 244 functions in our benchmark suite, most in a few minutes, and the remainder in a few hours. Automatic Tuning of Libraries and Applications, LACSI 2006

Search Space Static Statistics Function Insts Blk Loop Instances Len CF Leaves main(t) 1,275 110 6 2,882,021 29 389 15,164 parse_sw…(j) 1,228 144 1 180,762 20 53 2,057 askmode(i) 942 84 3 232,453 24 108 475 skiptoword(i) 901 439,994 22 103 2,834 start_in…(j) 795 50 8,521 16 45 80 treeinit(i) 666 59 8,940 15 240 pfx_list…(i) 640 2 1,269,638 44 136 4,660 main(f) 624 35 5 2,789,903 33 122 4,214 sha_tran…(h) 541 25 548,812 32 98 5,262 initckch(i) 536 48 1,075,278 4,988 main(p) 483 26 14,510 10 178 pat_insert(p) 469 41 4 1,088,108 71 3,021 .... average 234.3 21.7 1.2 174,574.8 16.1 47.4 813.4 Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Outline Experimental framework Exhaustive phase order space enumeration Accurately determining dynamic performance Analyzing the DAG of function instances Conclusions Automatic Tuning of Libraries and Applications, LACSI 2006

Finding the Best Dynamic Function Instance On average, there were 174,575 distinct function instances for each studied function. Executing all distinct function instances would be too time consuming. Many embedded development environments use simulation instead of direct execution. Is it possible to infer dynamic performance of one function instance from another? Automatic Tuning of Libraries and Applications, LACSI 2006

Quickly Obtaining Dynamic Frequency Measures Two different instances of the same function having identical control-flow graphs will execute each block the same number of times. Statically estimate the number of cycles required to execute each basic block. dynamic frequency measure = S (static cycles * block frequency) Automatic Tuning of Libraries and Applications, LACSI 2006

Dynamic Frequency Statistics Function Instances CF Leaves % from optimal Batch Worst main(t) 2,882,021 389 15,164 0.0 84.3 parse_sw…(j) 180,762 53 2,057 6.7 64.8 askmode(i) 232,453 108 475 8.4 56.2 skiptoword(i) 439,994 103 2,834 6.1 49.6 start_in…(j) 8,521 45 80 1.7 28.4 treeinit(i) 8,940 22 240 3.4 pfx_list…(i) 1,269,638 136 4,660 4.3 78.6 main(f) 2,789,903 122 4,214 7.5 46.1 sha_tran…(h) 548,812 98 5,262 9.6 133.4 initckch(i) 1,075,278 32 4,988 108.4 main(p) 14,510 10 178 7.7 13.1 pat_insert(p) 1,088,108 71 3,021 126.4 .... …. average 174,574.8 47.4 813.4 4.8 65.4 Automatic Tuning of Libraries and Applications, LACSI 2006

Correlation - Dynamic Frequency Measures & Processor Cycles For leaf function instances Pearson’s correlation coefficient – 0.96 performing 4.38 executions will get within 98% of optimal, and 21 within 99.6% of optimal Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Outline Experimental framework Exhaustive phase order space enumeration Accurately determining dynamic performance Analyzing the DAG of function instances Conclusions Automatic Tuning of Libraries and Applications, LACSI 2006

Analyzing the DAG of Function Instances Can analyze the DAG of function instances to learn properties about the optimization phase order space. Can be used to improve Conventional compilation Searches for effective optimization phase orderings Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Outline Experimental framework Exhaustive phase order space enumeration Accurately determining dynamic performance Analyzing the DAG of function instances Conclusions Automatic Tuning of Libraries and Applications, LACSI 2006

Automatic Tuning of Libraries and Applications, LACSI 2006 Conclusions First work to show that the optimization phase order space can often be completely enumerated (at least for the phases in our compiler). Demonstrated how a near-optimal phase ordering can be obtained in a short period of time. Used phase interaction information from the enumerated phase order space to achieve a much faster compiler that still generated comparable code. Automatic Tuning of Libraries and Applications, LACSI 2006