Classic Optimizing Compilers IBM’s Fortran H Compiler Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512.

Classic Optimizing Compilers IBM’s Fortran H Compiler Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011 1COMP 512, Rice University

2 Compiler design has been fixed since 1960 Front End, Middle End, & Back End Series of filter-style passes (number of passes varies) Fixed order for passes Classic Compilers Front EndMiddle EndBack End

COMP 512, Rice University3 1957: The F ORTRAN Automatic Coding System Six passes in a fixed order Generated good code Assumed unlimited index registers Code motion out of loops, with ifs and gotos Did flow analysis & register allocation Classic Compilers Front End Front End Middle End Back End Index Optimiz’n Code Merge bookkeeping Flow Analysis Register Alloc’n Final Assembly

COMP 512, Rice University4 1999: The SUIF Compiler System (in the NCI) Academic research system (Stanford) 3 front ends, 3 back ends 18 passes, configurable order Two-level IR (High S UIF, Low S UIF ) Intended as research infrastructure Middle End Fortran 77 C & C++ Java C/Fortran Alpha x86 Front EndBack End Classic Compilers Data dependence analysis Scalar & array privitization Reduction recognition Pointer analysis Affine loop transformations Blocking Capturing object definitions Virtual function call elimination Garbage collection SSA construction Dead code elimination Partial redundancy elimination Constant propagation Global value numbering Strength reduction Reassociation Instruction scheduling Register allocation *

COMP 512, Rice University5 2000: The SGI Pro64 Compiler, now “Open 64” Open source optimizing compiler for IA 64 3 front ends, 1 back end Five-level IR Gradual lowering of abstraction level Classic Compilers Fortran C & C++ Java Front End Middle End Back End Interpr. Anal. & Optim’n Loop Nest Optim’n Global Optim’n Code Gen. Interprocedural Classic analysis Inlining (user & library code) Cloning (constants & locality) Dead function elimination Dead variable elimination Loop Nest Optimization Dependence analysis Parallelization Loop transformations (fission, fusion, interchange, peeling, tiling, unroll & jam) Array privitization Global Optimization SSA-based analysis & opt’n Constant propagation, PRE, OSR+LFTR, DVNT, DCE (also used by other phases) Code Generation If conversion & predication Code motion Scheduling (inc. sw pipelining) Allocation Peephole optimization *

COMP 512, Rice University6 Even a 2000 JIT fits the mold, albeit with fewer passes Front end tasks are handled elsewhere Few (if any) optimizations Avoid expensive analysis Emphasis on generating native code Compilation must be profitable Classic Compilers Middle End Back End bytecode native code Java Environment

COMP 512, Rice University7 Most optimizing compilers fit this basic framework What’s the difference between them?  More boxes, better boxes, different boxes  Picking the right boxes in the right order To understand the issues  Must study compilers, for big picture issues  Must study boxes, for detail issues Look at some of the great compilers of yesteryear Classic Compilers Front EndMiddle EndBack End We will do both

COMP 512, Rice University8 Fortran H Enhanced (the “new” compiler) Improved Optimization of Fortran Object Programs R.G. Scarborough & H.G. Kolsky Started with a good compiler — Fortran H Extended Fortran H - one of 1 st commercial compilers to perform systematic analysis (both control flow & data flow) Extended for System 370 features Subsequently served as model for parts of VS Fortran Authors had commercial concerns Compilation speed Bit-by-bit equality of results Numerical methods must remain fixed —- not a great compiler Paper describes improvements in the -O2 optimization path

COMP 512, Rice University9 Fortran H Extended (the “old” compiler) Some of its quality comes from choosing the right code shape Translation to quads performs careful local optimization Replace integer multiply by 2 k with a shift Expand exponentiation by known integer constant Performs minor algebraic simplification on the fly  Handling multiple negations, local constant folding Classic example of “code shape” Bill Wulf popularized the term ( probably coined it ) Refers to the choice of specific code sequences “Shape” often encodes heuristics to handle complex issues

COMP 512, Rice University10 Code Shape My favorite example What if x is 2 and z is 3? What if y+z is evaluated earlier? The “best” shape for the x+y+z depends on contextual knowledge  There may be several conflicting options x + y + z x + y  t1 t1+ z  t2 x + z  t1 t1+ y  t2 y + z  t1 t1+ z  t2  zyx   z yx   y zx   x zy Addition is commutative & associative for integers

COMP 512, Rice University11 Fortran H Extended (old) Some of the improvement in Fortran H comes from choosing the right code shape for their target & their optimizations Shape simplifies the analysis & optimization Shape encodes heuristics to handle complex issues The rest came from systematic application of a few optimizations Common subexpression elimination Code motion Strength reduction Register allocation Branch optimization Not many optimizations, by modern standards … (e.g., SUIF, OPEN 64)

COMP 512, Rice University12 Summary This compiler fits the classic model Focused on a single loop at a time for optimization Worked innermost loop to outermost loop Compiler was just 27,415 lines of Fortran + 16,721 lines of asm Classic Compilers (old) Front End Middle EndBack End Scan & Parse Code Mot’n OSR Reg. Alloc. Final Assy. CSE Build CFG & D OM Parser written in Fortran 66!

Aggregate operations for a plasma physics code, in millions This work began as a study of customer applications Found many loops that could be better Project aimed to produce hand-coded quality Project had clear, well-defined standards & goals Project had clear, well-defined stopping point Fortran H Extended was already an effective compiler Huge decrease in overhead ops Another 35% 78% reduction Little decrease in useful ops COMP 512, Rice University13 Fortran H Enhanced (new) *

COMP 512, Rice University14 Fortran H Enhanced (new) How did they improve it? The work focused on four areas Reassociation of subscript expressions Rejuvenating strength reduction Improving register allocation Engineering issues Note: this is not a long list !

COMP 512, Rice University15 Reassociation of Subscript Expressions Don’t generate the standard address polynomial For those of you educated from EaC, a history lesson is needed Prior to this paper (& much later in the texts) the conventional wisdom was to generate the following code: For a 2-d array A declared as A(low 1 :high 1,low 2 :high 2 ) The reference A(i 1,i 2 ) generates the polynomial A 0 + ((i 2 - low 2 ) x (high 1 - low 1 + 1) + i 1 - low 1 ) x w This form of the polynomial minimizes total ops  Good for operation count, bad for common subexpression elimination, strength reduction, instruction scheduling, …  With A(i+1,j) and A(i+1,j+1) the difference is bound into the expression before the common piece can be exposed Length of dim 1 Precompute it Column-major order because the paper is about FORTRAN

COMP 512, Rice University16 For a 2-d array A declared as A(low 1 :high 1,low 2 :high 2 ) The reference A(i 1,i 2 ) generates the polynomial A 0 + ((i 2 - low 2 ) x (high 1 - low 1 + 1) + i 1 - low 1 ) x w This form of the polynomial minimizes total ops  Good for operation count, bad for common subexpression elimination, strength reduction, instruction scheduling, …  With A(i+1,j) and A(i+1,j+1) the difference is bound into the expression before the common piece can be exposed Now, imagine a typical “stencil” computation a(i,j) = (a(i-1,j) + a(i,j) + a(i+1,j) + a(i,j-1) + a(i,j+1))/5 Reassociation of Subscript Expressions Still column-major order Surrounding loops (on i, then j) move the stencil over the entire array, adjusting the value of the central element … Typical stencils include 5, 7, 11 points

COMP 512, Rice University17 For a 2-d array A declared as A(low 1 :high 1,low 2 :high 2 ) The reference A(i 1,i 2 ) generates the polynomial A 0 + ((i 2 - low 2 ) x (high 1 - low 1 + 1) + i 1 - low 1 ) x w This form of the polynomial minimizes total ops  Good for operation count, bad for common subexpression elimination, strength reduction, instruction scheduling, …  With A(i+1,j) and A(i+1,j+1) the difference is bound into the expression before the common piece can be exposed Now, imagine a typical “stencil” computation a(i,j) = (a(i-1,j) + a(i,j) + a(i+1,j) + a(i,j-1) + a(i,j+1))/5 And the subexpressions found (or hidden) inside it … Reassociation of Subscript Expressions Still column-major order

COMP 512, Rice University18 Reassociation of Subscript Expressions Don’t generate the standard address polynomial  Forget the classic address polynomial Break polynomial into six parts  Separate the parts that fall naturally into outer loops  Compute everything possible at compile time Makes the tree for address expressions broad, not deep Group together operands that vary at the same loop level The point Pick the right shape for the code ( expose the opportunity ) Let other optimizations do the work Sources of improvement  Fewer operations execute  Decreases sensitivity to number of dimensions Read p 665ff carefully Tradeoff driven by cse versus strength reduction Novel improvement

COMP 512, Rice University19 Reassociation of Subscript Expressions Distribution creates different expressions w + y * (x + z)  w + y * x + y * z More operations, but they may move to different places Consider A(i,j), where A is declared A(0:n,0:m) Standard polynomial: @A + (i * m + j) * w Alternative: @A + i * m * w + j * w Does this help? i part and j part vary in different loops Standard polynomial pins j in the loop where i varies Can produce significant reductions in operation count General problem, however, is quite complex

COMP 512, Rice University20 Reduction of Strength Many cases had been disabled in maintenance  Almost all the subtraction cases turned off Fixed the bugs and re-enables the corresponding cases Caught “almost all” the eligible cases Extensions Iterate the transformations  Avoid ordering problems (i+j)*4  Catch secondary effects Capitalize on user-coded reductions ( shape ) Eliminate duplicate induction variables ( reassociation )

COMP 512, Rice University21 Register Allocation Original Allocator Divide register set into local & global pools Different mechanisms for each pool Problems Bad interactions between local & global allocation Unused registers dedicated to the procedure linkage Unused registers dedicated to the global pool Extra (unneeded) initializations Remember the 360  Two-address machine  Destructive operations *

COMP 512, Rice University22 Register Allocation New Allocator Remap to avoid local/global duplication Scavenge unused registers for local use Remove dead initializations Section-oriented branch optimizations Plus … Change in local spill heuristic Can allocate all four floating-point registers Bias register choice by selection in inner loops Better spill cost estimates Better branch-on-index selection All symptoms arise from not having a global register allocator — such as a graph coloring allocator

COMP 512, Rice University23 Engineering Issues Increased the name space Was 127 slots (80 for variables & constants, 47 for compiler) Increased to 991 slots Constants no longer need slots “Very large” routines need < 700 slots ( remember inlining study? ) Common subexpression elimination ( CSE ) Removed limit on backward search for CSE s Taught CSE to avoid some substitutions that cause spills Extended constant handling to negative values

COMP 512, Rice University24 Results They stopped working on the optimizer. Hand-coding no longer improved the inner loops.  Produced significant change in ratio of flops to instructions I consider this to be the classic Fortran optimizing compiler Aggregate operations for a plasma physics code, in millions

COMP 512, Rice University25 Results Final points Performance numbers vary from model to model Compiler ran faster, too ! It relies on A handful of carefully targeted optimizations Generating the right IR in the first place ( code shape ) Next class  General discussion of optimization  Terms, scopes, goals, methods  Simple examples No particular assigned reading. Might look over local value numbering in Chapter 8 of EaC or in the Briggs, Cooper, & Simpson paper.

Classic Optimizing Compilers IBM’s Fortran H Compiler Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512.

Similar presentations

Presentation on theme: "Classic Optimizing Compilers IBM’s Fortran H Compiler Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classic Optimizing Compilers IBM’s Fortran H Compiler Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512.

Similar presentations

Presentation on theme: "Classic Optimizing Compilers IBM’s Fortran H Compiler Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512."— Presentation transcript:

Similar presentations

About project

Feedback