Computer & Information Sciences - University of Delaware Colloquium / 55 Exhaustive Phase Order Search Space Exploration and Evaluation by Prasad Kulkarni.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 CS 201 Compiler Construction Machine Code Generation.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.
Electrical Engineering & Computer Science - University of Kansas Colloquium / 55 Exhaustive Phase Order Search Space Exploration and Evaluation by Prasad.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Query Processing Presented by Aung S. Win.
Precision Going back to constant prop, in what cases would we lose precision?
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Fast, Effective Code Generation in a Just-In-Time Java Compiler Rejin P. James & Roshan C. Subudhi CSE Department USC, Columbia.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine. Interpreter of the virtual machine is invoked to execute the.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Florida State University Symposium on Code Generation and Optimization Exhaustive Optimization Phase Order Space Exploration Prasad A. Kulkarni.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Florida State University Automatic Tuning of Libraries and Applications, LACSI 2006 In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter# 6 Code generation.  The final phase in our compiler model is the code generator.  It takes as input the intermediate representation(IR) produced.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Advanced Computer Systems
Code Optimization.
Optimization Code Optimization ©SoftMoore Consulting.
Department of Electrical & Computer Engineering
Presented by: Sameer Kulkarni
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Objective of This Course
CS 201 Compiler Construction
In Search of Near-Optimal Optimization Phase Orderings
8 Code Generation Topics A simple code generator algorithm
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Dynamic Hardware Prediction
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Computer & Information Sciences - University of Delaware Colloquium / 55 Exhaustive Phase Order Search Space Exploration and Evaluation by Prasad Kulkarni (Florida State University)

Computer & Information Sciences - University of Delaware Colloquium / 552 Compiler Optimizations To improve efficiency of compiler generated code Optimization phases require enabling conditions –need specific patterns in the code –many also need available registers Phases interact with each other Applying optimizations in different orders generates different code

Computer & Information Sciences - University of Delaware Colloquium / 553 Phase Ordering Problem To find an ordering of optimization phases that produces optimal code with respect to possible phase orderings Evaluating each sequence involves compiling, assembling, linking, execution and verifying results Best optimization phase ordering depends on –source application –target platform –implementation of optimization phases Long standing problem in compiler optimization!!

Computer & Information Sciences - University of Delaware Colloquium / 554 Phase Ordering Space Current compilers incorporate numerous different optimization phases –15 distinct phases in our compiler backend 15! = 1,307,674,368,000 Phases can enable each other –any phase can be active multiple times = 437,893,890,380,859,375 –cannot restrict sequence length to = * 10 51

Computer & Information Sciences - University of Delaware Colloquium / 555 Addressing Phase Ordering Exhaustive Search –universally considered intractable We are now able to exhaustively evaluate the optimization phase order space.

Computer & Information Sciences - University of Delaware Colloquium / 556 Re-stating of Phase Ordering Earlier approach –explicitly enumerate all possible optimization phase orderings Our approach –explicitly enumerate all function instances that can be produced by any combination of phases

Computer & Information Sciences - University of Delaware Colloquium / 557 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 558 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 559 Experimental Framework We used the VPO compilation system –established compiler framework, started development in 1988 –comparable performance to gcc –O2 VPO performs all transformations on a single representation (RTLs), so it is possible to perform most phases in an arbitrary order Experiments use all the 15 re-orderable optimization phases in VPO Target architecture was the StrongARM SA-100 processor

Computer & Information Sciences - University of Delaware Colloquium / 5510 VPO Optimization Phases IDOptimization PhaseIDOptimization Phase bbranch chaininglloop transformations ccommon subexpr. elim.ncode abstraction dremv. unreachable codeoeval. order determin. gloop unrollingqstrength reduction hdead assignment elim.rreverse branches iblock reorderingsinstruction selection jminimize loop jumpsuremv. useless jumps kregister allocation

Computer & Information Sciences - University of Delaware Colloquium / 5511 Disclaimers Did not include optimization phases normally associated with compiler front ends –no memory hierarchy optimizations –no inlining or other interprocedural optimizations Did not vary how phases are applied Did not include optimizations that require profile data

Computer & Information Sciences - University of Delaware Colloquium / 5512 Benchmarks 12 MiBench benchmarks; 244 functions CategoryProgramDescription auto bitcounttest processor bit manipulation abilities qsortsort strings using quicksort sorting algorithm network dijkstraDijkstra’s shortest path algorithm patriciaconstruct patricia trie for IP traffic telecomm fftfast fourier transform adpcmcompress 16-bit linear PCM samples to 4-bit consumer jpegimage compression and decompression tiff2bwconvert color.tiff image to b&w image security shasecure hash algorithm blowfishsymmetric block cipher with variable length key office stringsearchsearches for given words in phrases ispellfast spelling checker

Computer & Information Sciences - University of Delaware Colloquium / 5513 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 5514 Terminology Active phase – An optimization phase that modifies the function representation Dormant phase – A phase that is unable to find any opportunity to change the function Function instance – any semantically, syntactically, and functionally correct representation of the source function (that can be produced by our compiler)

Computer & Information Sciences - University of Delaware Colloquium / 5515 Naïve Optimization Phase Order Space All combinations of optimization phase sequences are attempted a b c d a bc dadadad bcbcbc L2 L1 L0

Computer & Information Sciences - University of Delaware Colloquium / 5516 Eliminating Consecutively Applied Phases A phase just applied in our compiler cannot be immediately active again a b c d bc dadada cbbc L2 L1 L0 a bc d

Computer & Information Sciences - University of Delaware Colloquium / 5517 Eliminating Dormant Phases Get feedback from the compiler indicating if any transformations were successfully applied in a phase. L2 L1 L0 a b c d bc dadad cb a bc

Computer & Information Sciences - University of Delaware Colloquium / 5518 Identical Function Instances Some optimization phases are independent –example: branch chaining & register allocation Different phase sequences can produce the same code r[2] = 1; r[2] = 1; r[3] = r[4] + r[2]; r[3] = r[4] + r[2];  instruction selection r[3] = r[4] + 1; r[3] = r[4] + 1; r[2] = 1; r[2] = 1; r[3] = r[4] + r[2]; r[3] = r[4] + r[2];  constant propagation r[2] = 1; r[2] = 1; r[3] = r[4] + 1; r[3] = r[4] + 1;  dead assignment elimination r[3] = r[4] + 1; r[3] = r[4] + 1;

Computer & Information Sciences - University of Delaware Colloquium / 5519 Equivalent Function Instances sum = 0; for (i = 0; i < 1000; i++ ) sum += a [ i ]; Source Code r[10]=0; r[12]=HI[a]; r[12]=r[12]+LO[a]; r[1]=r[12]; r[9]=4000+r[12]; L3 r[8]=M[r[1]]; r[10]=r[10]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L3; Register Allocation before Code Motion r[11]=0; r[10]=HI[a]; r[10]=r[10]+LO[a]; r[1]=r[10]; r[9]=4000+r[10]; L5 r[8]=M[r[1]]; r[11]=r[11]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L5; Code Motion before Register Allocation r[32]=0; r[33]=HI[a]; r[33]=r[33]+LO[a]; r[34]=r[33]; r[35]=4000+r[33]; L01 r[36]=M[r[34]]; r[32]=r[32]+r[36]; r[34]=r[34]+4; IC=r[34]?r[35]; PC=IC<0,L01; After Mapping Registers

Computer & Information Sciences - University of Delaware Colloquium / 5520 Efficient Detection of Unique Function Instances After pruning dormant phases there may be tens or hundreds of thousands of unique instances Use a CRC (cyclic redundancy check) checksum on the bytes of the RTLs representing the instructions Used a hash table to check if an identical or equivalent function instance already exists in the DAG

Computer & Information Sciences - University of Delaware Colloquium / 5521 Eliminating Identical/Equivalent Function Instances Resulting search space is a DAG of function instances L2 L1 L0 a b c c d a d a d

Computer & Information Sciences - University of Delaware Colloquium / 5522 Static Enumeration Results FunctionInst.Fn_instLenCF Batch Vs. optimal start_input_bmp (j)1,372120, correct(i)1,2951,348, main (t)1,2762,882, parse_switches (j) 1,228180, start_input_gif (j) , start_input_tga (j) 97263, askmode (i)942232, skiptoword (i)901439, start_input_ppm (j)7958, Average (234) ,

Computer & Information Sciences - University of Delaware Colloquium / 5523 Exhaustively enumerated the optimization phase order space to find an optimal phase ordering with respect to code-size [Published in CGO ’06]

Computer & Information Sciences - University of Delaware Colloquium / 5524 Determining Program Performance Almost 175,000 distinct function instances, on average –largest enumerated function has 2,882,021 instances Too time consuming to execute each distinct function instance –assemble  link  execute more expensive than compilation Many embedded development environments use simulation –simulation orders of magnitude more expensive than execution Use data obtained from a few executions to estimate the performance of all remaining function instances

Computer & Information Sciences - University of Delaware Colloquium / 5525 Determining Program Performance (cont...) Function instances having identical control-flow graphs execute each block the same number of times Execute application once for each control- flow structure Statically estimate the number of cycles required to execute each basic block dynamic frequency measure =  static cycles * block frequency)

Computer & Information Sciences - University of Delaware Colloquium / 5526 Predicting Relative Performance – I 4 cycles 20 cycles 27 cycles 2 cycles 5 cycles 10 cycles cycles 22 cycles 25 cycles 2 cycles 10 cycles Total cycles = 744 Total cycles = 789

Computer & Information Sciences - University of Delaware Colloquium / 5527 Dynamic Frequency Results FunctionInst.Fn_instLenCFLeaf % from optimal BatchWorst main (t)1,2762,882, , parse_switches (j) 1,228180, , askmode (i)942232, skiptoword (i)901439, , start_input_ppm (j)7958, pfx_list_chk (i)6401,269, , main (f)6242,789, , sha_transform (h)541548, , main (p)48314, Average (79) ,

Computer & Information Sciences - University of Delaware Colloquium / 5528 Correlation – Dynamic Frequency Counts Vs. Simulator Cycles Static performance estimation is inaccurate –ignored cache/branch misprediction penalties Most embedded systems have simpler architectures –estimation may be sufficiently accurate –simulator cycles are close to executed cycles We show strong correlation between our measure of performance and simulator cycles

Computer & Information Sciences - University of Delaware Colloquium / 5529 Complete Function Correlation Example: init_search in stringsearch

Computer & Information Sciences - University of Delaware Colloquium / 5530 Leaf Function Correlation Leaf function instances are generated when no additional phases can be successfully applied Leaf instances provide a good sampling –represents the only code that can be generated by an aggressive compiler, like VPO –at least one leaf instance represents an optimal phase ordering for over 86% of functions –significant percent of leaf instances among optimal

Computer & Information Sciences - University of Delaware Colloquium / 5531 Leaf Function Correlation Statistics Pearson’s correlation coefficient – Accuracy of our estimate of optimal perf. –  xy – (  x  y)/n sqrt( (  x 2 – (  x) 2 /n) * (  y 2 - (  y) 2 /n) ) Pcorr = Lcorr = cycle count for best leaf cy. cnt for leaf with best dynamic freq count

Computer & Information Sciences - University of Delaware Colloquium / 5532 Leaf Function Correlation Statistics (cont…) FunctionPcorr Lcorr 0%Lcorr 1% RatioLeavesRatioLeaves AR_btbl...(b) BW_btbl...(b) bit_count.(b) bit_shifter(b) bitcount(b) main(b) ntbl_bitcnt(b) ntbl_bit…(b) dequeue(d) dijkstra(d) …. average

Computer & Information Sciences - University of Delaware Colloquium / 5533 Exhaustively evaluated the optimization phase order space to find a near-optimal phase ordering with respect to simulator cycles [Published in LCTES ’06]

Computer & Information Sciences - University of Delaware Colloquium / 5534 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 5535 Phase Enabling Interaction b enables a along the path a-b-a a b c b accb ad

Computer & Information Sciences - University of Delaware Colloquium / 5536 Phase Enabling Probabilities Ph St b c d g h i j k l n o q r s u b c d 1.00 g h i j k l n o q r s u

Computer & Information Sciences - University of Delaware Colloquium / 5537 Phase Disabling Interaction b disables a along the path b-c-d a b c b accb ad

Computer & Information Sciences - University of Delaware Colloquium / 5538 Disabling Probabilities Ph b c d g h i j k l n o q r s u b c d g h i j k l n o q r s u

Computer & Information Sciences - University of Delaware Colloquium / 5539 Faster Conventional Compiler Modified VPO to use enabling and disabling phase probabilities to decrease compilation time # p[i] - current probability of phase i being active # e[i][j] - probability of phase j enabling phase i # d[i][j] - probability of phase j disabling phase i For each phase i do p[i] = e[i][st]; While (any p[i] > 0) do Select j as the current phase with highest probability of being active Apply phase j If phase j was active then For each phase i, where i != j do p[i] += ((1-p[i]) * e[i][j]) - (p[i] * d[i][j]) p[j] = 0

Computer & Information Sciences - University of Delaware Colloquium / 5540 Probabilistic Compilation Results FunctionOld CompilationProb. CompilationProb. / Old AttemptedActiveAttemptedActiveTimeSizeSpeed start_inp...(j) N/A parse_swi...(j) start_inp...(j) N/A start_inp...(j) N/A start_inp...(j) fft_float(f) main(f) sha_trans...(h) read_scan...(j) N/A LZWReadByte(j) N/A main(j) dijkstra(d) average

Computer & Information Sciences - University of Delaware Colloquium / 5541 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 5542 Conclusions Phase ordering problem –long standing problem in compiler optimization –exhaustive evaluation always considered infeasible Exhaustively evaluated the phase order space –re-interpretation of the problem –novel application of search algorithms –fast pruning techniques –accurate prediction of relative performance Analyzed properties of the phase order space to speedup conventional compilation published in CGO’06, LCTES’06, submitted to TOPLAS

Computer & Information Sciences - University of Delaware Colloquium / 5543 Challenges Exhaustive phase order search is a severe stress test for the compiler –isolate analysis required and invalidated by each phase –produce correct code for all phase orderings –eliminate all memory leaks Search algorithm needs to be highly efficient –used CRCs and hashes for function comparisons –stored intermediate function instances to reduce disk access –maintained logs to restart search after crash

Computer & Information Sciences - University of Delaware Colloquium / 5544 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 5545 VISTA Provides an interactive code improvement paradigm –view low-level program representation –apply existing phases and manual changes in any order –browse and undo previous changes –automatically obtain performance information –automatically search for effective phase sequences Useful as a research as well as teaching tool –employed in three universities published in LCTES ’03, TECS ‘06

Computer & Information Sciences - University of Delaware Colloquium / 5546 VISTA – Main Window

Computer & Information Sciences - University of Delaware Colloquium / 5547 Faster Genetic Algorithm Searches Improving performance of genetic algorithms –avoid redundant executions of the application over 87% of executions were avoided reduce search time by 62% –modify search to obtain comparable results in fewer generations reduced GA generations by 59% reduce search time by 35% published in PLDI ’04, TACO ’05

Computer & Information Sciences - University of Delaware Colloquium / 5548 Heuristic Search Algorithms Analyzing the phase order space to improve heuristic algorithms –detailed performance and cost comparison of different heuristic algorithms –demonstrated the importance and difficulty of selecting the correct sequence length –illustrated the importance of leaf function instances –proposed modifications to existing algorithms, and new search algorithms Will be published in CGO ‘07

Computer & Information Sciences - University of Delaware Colloquium / 5549 Dynamic Compilation Explored asynchronous dynamic compilation in a virtual machine –demonstrated shortcomings of current popular compilation strategy –describe importance of minimum compiler utilization –discussed new compilation strategies –explored the changes needed to current compilation strategies to exploit free cycles Submitted to VEE ‘07

Computer & Information Sciences - University of Delaware Colloquium / 5550 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions

Computer & Information Sciences - University of Delaware Colloquium / 5551 Support for parallelism –traditional languages –express parallelism –dynamic scheduling Virtual machines –dynamic code generation and optimization Push compilation decisions further down Compiler Technology Challenges c omp i le r multi-core –heterogeneous cores No great solution –performance monitoring –software-controlled reconfiguration Can no longer do it alone High Level Language Machine Architecture

Computer & Information Sciences - University of Delaware Colloquium / 5552 Iterative Compilation & Machine Learning Improved scope for iterative compilation & machine learning –proliferation of new architectures automate tuning compiler heuristics –tuning important libraries –using performance monitors dynamic JIT compilers How to use machine learning to optimize and schedule more efficiently ?

Computer & Information Sciences - University of Delaware Colloquium / 5553 Dynamic Compilation Virtual machines likely to grow in importance –productivity, portability, interoperability, isolation... Challenges –when, what, how to parallelize –using hardware performance monitors –using static analyses to aid dynamic compilation –debugging tools for correctness and performance debugging

Computer & Information Sciences - University of Delaware Colloquium / 5554 Heterogeneous Multi-core Architectures Can provide the best performance, cost, power balance Challenges –schedule tasks, allocate resources –dynamic core-specific optimization –automatic data layout to prevent conflicts

Computer & Information Sciences - University of Delaware Colloquium / 5555 Questions ?

Computer & Information Sciences - University of Delaware Colloquium / 5556 Results – Code Size Summary Exhaustively enumerated 234 out of 244 functions Sequence length –Maximum: 44; Average: Distinct function instances –Maximum: 2,882,021; Average: 89,947 Distinct control-flows –Maximum: 920; Average: 36.2 Code size improvement over default sequence –Maximum: 63%; Average: 6.46%

Computer & Information Sciences - University of Delaware Colloquium / 5557 Results – Performance Summary Exhaustively evaluated 79 out of 88 functions Sequence length –Maximum: 44; Average: 16.1 Distinct function instances –Maximum: 2,882,021; Average: 174,574.8 Distinct control-flows –Maximum: 920; Average: 47.4 Performance improvement over default sequence –Maximum: 15%; Average: 4.8%

Computer & Information Sciences - University of Delaware Colloquium / 5558 Leaf Vs. Non-Leaf Performance

Computer & Information Sciences - University of Delaware Colloquium / 5559 Phase Order Space Evaluation – Summary generatenextoptimizationsequence lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG

Computer & Information Sciences - University of Delaware Colloquium / 5560 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5561 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5562 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5563 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5564 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5565 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5566 Phase Order Space Evaluation – Summary lastphaseactive?identicalfunctioninstance?equivalentfunctioninstance? calculatefunctionperformancesimulateapplicationseencontrol-flowstructure? Y YY Y N NN N add node to DAG generatenextoptimizationsequence

Computer & Information Sciences - University of Delaware Colloquium / 5567 Weighted Function Instance DAG Each node is weighted by the number of paths to a leaf node [abc] [bc] [c][ab] [d] [a] a b c b accb ad

Computer & Information Sciences - University of Delaware Colloquium / 5568 Predicting Relative Performance – II 4 cycles 10 cycles 5 4 cycles ? 15 cycles 26 cycles 15 cycles 90 cycles 2 cycles 44 cycles 10 cycles ? ? ? ? ? ? ? Total cycles = 170 Total cycles = ?? 10 cycles 5 5 5

Computer & Information Sciences - University of Delaware Colloquium / 5569 Case when No Leaf is Optimal