Terminology, Principles, and Concerns, III With examples from DOM (Ch 9) and DVNT (Ch 10) Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Slides:



Advertisements
Similar presentations
Code Optimization, Part II Regional Techniques Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Advertisements

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
A Deeper Look at Data-flow Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University.
SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
1 Data flow analysis Goal : collect information about how a procedure manipulates its data This information is used in various optimizations For example,
Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.
Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.
CS 412/413 Spring 2007Introduction to Compilers1 Lecture 29: Control Flow Analysis 9 Apr 07 CS412/413 Introduction to Compilers Tim Teitelbaum.
Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Loops Guo, Yao.
PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Global Common Subexpression Elimination with Data-flow Analysis Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Structural Data-flow Analysis Algorithms: Allen-Cocke Interval Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.
Introduction to Optimization, II Value Numbering & Larger Scopes Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
First Principles (with examples from value numbering) C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
The Procedure Abstraction, Part V: Support for OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Using SSA Dead Code Elimination & Constant Propagation C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.
12/5/2002© 2002 Hal Perkins & UW CSER-1 CSE 582 – Compilers Data-flow Analysis Hal Perkins Autumn 2002.
Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
11/23/2015© Hal Perkins & UW CSEQ-1 CSE P 501 – Compilers Introduction to Optimization Hal Perkins Autumn 2009.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
Building SSA Form (A mildly abridged account) For the full story, see the lecture notes for COMP 512 (lecture 8) and.
Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.
Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Terminology, Principles, and Concerns, II With examples from superlocal value numbering (Ch 8 in EaC2e) Copyright 2011, Keith D. Cooper & Linda Torczon,
Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Iterative Data-flow Analysis C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students.
Building SSA Form, I 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
Definition-Use Chains
Introduction to Optimization
Finding Global Redundancies with Hopcroft’s DFA Minimization Algorithm
Princeton University Spring 2016
Introduction to Optimization
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Code Generation
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Instruction Scheduling: Beyond Basic Blocks
Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Optimization Hal Perkins Summer 2004
Optimization through Redundancy Elimination: Value Numbering at Different Scopes COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith.
Code Optimization Overview and Examples Control Flow Graph
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Data Flow Analysis Compiler Design
Dataflow Analysis Hal Perkins Winter 2008
Introduction to Optimization
Instruction Scheduling: Beyond Basic Blocks
The Partitioning Algorithm for Detecting Congruent Expressions COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper.
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Introduction to Optimization
Presentation transcript:

Terminology, Principles, and Concerns, III With examples from DOM (Ch 9) and DVNT (Ch 10) Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011

Last Lecture Extended Basic Blocks Superlocal value numbering > Treat each path as a single basic block > Use a scoped hash table & SSA names to make it efficient COMP 512, Rice University2

This Lecture Dominator Trees  Computing dominator information  Global data-flow analysis Dominator-based Value Numbering  Enhance the Superlocal Value Numbering algorithm so that it can cover more blocks Optimizing a loop nest  Finding loop nests  Loop unrolling as an initial transformation COMP 512, Rice University3

4 This is in SSA Form Superlocal Value Numbering m 0  a + b n 0  a + b A p 0  c + d r 0  c + d B r 2   (r 0,r 1 ) y 0  a + b z 0  c + d G q 0  a + b r 1  c + d C e 0  b + 18 s 0  a + b u 0  e + f D e 1  a + 17 t 0  c + d u 1  e + f E e 3   (e 0,e 1 ) u 2   (u 0,u 1 ) v 0  a + b w 0  c + d x 0  e + f F With all the bells & whistles Find more redundancy Pay little additional cost Still does nothing for F & G Superlocal techniques Some local methods extend cleanly to superlocal scopes VN does not back up If C adds to A, it’s a problem

COMP 512, Rice University5 What About Larger Scopes? We have not helped with F or G Multiple predecessors Must decide what facts hold in F and in G  For G, combine B & F?  Merging state is expensive  Fall back on what’s known G m 0  a + b n 0  a + b A p 0  c + d r 0  c + d B r 2   (r 0,r 1 ) y 0  a + b z 0  c + d q 0  a + b r 1  c + d C e 0  b + 18 s 0  a + b u 0  e + f D e 1  a + 17 t 0  c + d u 1  e + f E e 3   (e 0,e 1 ) u 2   (u 0,u 1 ) v 0  a + b w 0  c + d x 0  e + f F

COMP 512, Rice University6 Dominators Definitions x dominates y if and only if every path from the entry of the control-flow graph to the node for y includes x By definition, x dominates x We associate a DOM set with each node |DOM(x )| ≥ 1 Immediate dominators For any node x, there must be a y in DOM(x ) closest to x We call this y the immediate dominator of x As a matter of notation, we write this as IDOM(x )

COMP 512, Rice University7 Dominators Dominators have many uses in analysis & transformation Finding loops Building SSA form Making code motion decisions We’ll look at how to compute dominators later A BCG FED Dominator treeDominator sets Back to the discussion of value numbering over larger scopes... * m 0  a + b n 0  a + b A p 0  c + d r 0  c + d B r 2   (r 0,r 1 ) y 0  a + b z 0  c + d G q 0  a + b r 1  c + d C e 0  b + 18 s 0  a + b u 0  e + f D e 1  a + 17 t 0  c + d u 1  e + f E e 3   (e 0,e 1 ) u 2   (u 0,u 1 ) v 0  a + b w 0  c + d x 0  e + f F Original idea: R.T. Prosser. “Applications of Boolean matrices to the analysis of flow diagrams,” Proceedings of the Eastern Joint Computer Conference, Spartan Books, New York, pages , 1959.

COMP 512, Rice University8 What About Larger Scopes? We have not helped with F or G Multiple predecessors Must decide what facts hold in F and in G  For G, combine B & F?  Merging state is expensive  Fall back on what’s known Can use table from IDOM(x ) to start x  Use C for F and A for G  Imposes a Dom-based application order Leads to Dominator VN Technique ( DVNT ) * m 0  a + b n 0  a + b A p 0  c + d r 0  c + d B r 2   (r 0,r 1 ) y 0  a + b z 0  c + d G q 0  a + b r 1  c + d C e 0  b + 18 s 0  a + b u 0  e + f D e 1  a + 17 t 0  c + d u 1  e + f E e 3   (e 0,e 1 ) u 2   (u 0,u 1 ) v 0  a + b w 0  c + d x 0  e + f F

COMP 512, Rice University9 Dominator Value Numbering The DVNT Algorithm Use superlocal algorithm on extended basic blocks  Retain use of scoped hash tables & SSA name space Start each node with table from its IDOM  DVNT generalizes the superlocal algorithm No values flow along back edges ( i.e., around loops ) Constant folding, algebraic identities as before Larger scope leads to ( potentially ) better results  LVN + SVN + good start for EBBs missed by SVN

COMP 512, Rice University10 Dominator Value Numbering m  a + b n  a + b A p  c + d r  c + d B r 2   (r 0,r 1 ) y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E e 3   (e 1,e 2 ) u 2   (u 0,u 1 ) v  a + b w  c + d x  e + f F DVNT advantages Find more redundancy Little additional cost Retains online character DVNT shortcomings Misses some opportunities No loop-carried CSEs or constants

COMP 512, Rice University11 Computing Dominators Critical first step in SSA construction and in DVNT A node n dominates m iff n is on every path from n 0 to m  Every node dominates itself  n’s immediate dominator is its closest dominator, ID OM (n) † D OM (n 0 ) = { n 0 } D OM (n) = { n }  (  p  preds(n) D OM (p)) Computing DOM These simultaneous set equations define a simple problem in data-flow analysis Equations have a unique fixed point solution An iterative fixed-point algorithm will solve them quickly † ID OM (n ) ≠ n, unless n is n 0, by convention. Initially, D OM (n) = N,  n≠n 0

COMP 512, Rice University12 Round-robin Iterative Algorithm Termination Makes sweeps over the nodes Halts when some sweep produces no change DOM(b 0 )  Ø for i  1 to N DOM(b i )  { all nodes in graph } change  true while (change) change  false for i  0 to N T EMP  { i }  (  x  pred (b) DOM(x )) if DOM(b i ) ≠ T EMP then change  true DOM(b i )  T EMP

COMP 512, Rice University13 Example B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B0B0 Flow Graph Progress of iterative solution for D OM Results of iterative solution for D OM *

COMP 512, Rice University14 Example Dominance Tree Progress of iterative solution for D OM Results of iterative solution for D OM B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B0B0 There are asymptotically faster algorithms. With the right data structures, the iterative algorithm can be made extremely fast. See Cooper, Harvey, & Kennedy, on the web site, or algorithm in Chapter 9 of EaC.

Aside on Data-Flow Analysis The iterative DOM calculation is an example of data-flow analysis Data-flow analysis is a collection of techniques for compile-time reasoning about the run-time flow of values Data-flow analysis almost always operates on a graph  Problems are trivial in a basic block  Global problems use the control-flow graph (or derivative)  Interprocedural problems use call graph (or derivative) Data-flow problems are formulated as simultaneous equations  Sets attached to nodes and edges  One solution technique is the iterative algorithm Desired result is usually meet over all paths (MOP) solution  “What is true on every path from the entry node?”  “Can this event happen on any path from the entry?” COMP 512, Rice University15 Related to safety

Aside on Data-Flow Analysis Why did the iterative algorithm work? Termination The DOM sets are initialized to the (finite) set of nodes The DOM sets shrink monotonically The algorithm reaches a fixed point where they stop changing Correctness We can prove that the fixed point solution is also the MOP That proof is beyond today’s lecture, but we’ll revisit it Efficiency The round-robin algorithm is not particularly efficient Order in which we visit nodes is important for efficient solutions COMP 512, Rice University16

COMP 512, Rice University17 Regional Optimization: Improving Loops Compilers have always focused on loops Higher execution counts inside loop than outside loops Repeated, related operations Much of the real work takes place in loops (linear algebra) Several effects to attack in a loop or loop nest Overhead  Decrease control-structure cost per iteration Locality  Spatial locality  use of co-resident data  Temporal locality ⇒ reuse of same data item Parallelism  Move loops with independent iterations to outer position  Inner positions for vector hardware & SSE

Regional Optimization: Improving Loops Loop unrolling (the oldest trick in the book) To reduce overhead, replicate the loop body Sources of improvement Less overhead per useful operation Longer basic blocks for local optimization COMP 512, Rice University18 do i = 1 to 100 by 1 a(i) = a(i) + b(i) end do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+ 1 ) = a(i+ 1 ) + b(i+ 1 ) a(i+ 2 ) = a(i+ 2 ) + b(i+ 2 ) a(i+ 3 ) = a(i+ 3 ) + b(i+ 3 ) end becomes (unroll by 4)

Doesn’t mess up spatial locality on either y or m (column-major order) Regional Optimization: Improving Loops With loop nest, may unroll inner loop COMP 512, Rice University19 do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j) 50 continue 60 continue Critical inner loop from dmxpy in Linpack Doesn’t mess up reuse on x(j) do 60 j = 1, n2 nextra = mod(n1,4) if (nextra.ge. 0) then do 49 i = 1, nextra, 1 y(i) = y(i) + x(j) * m(i,j) 49 continue do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) y(i+1) = y(i+1) + x(j) * m(i+1,j) y(i+2) = y(i+2) + x(j) * m(i+2,j) y(i+3) = y(i+3) + x(j) * m(i+3,j) 50 continue 60 continue

Regional Optimization: Improving Loops With loop nest, may unroll outer loop Trick is to unroll outer loop and fuse resulting inner loops  Loop fusion combines the bodies of two similar loops COMP 512, Rice University20 do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j) 50 continue 60 continue Critical inner loop from dmxpy in Linpack do 60 j = 1, n2 nextra = mod(n1,4) if (nextra.ge. 1) then do 49 i, nextra, 1 y(i) = y(i) + x(j) * m(i,j) 49 continue do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) y(i) = y(i) + x(j+1) * m(i,j+1) y(i) = y(i) + x(j+2) * m(i,j+2) y(i) = y(i) + x(j+3) * m(i,j+3) 50 continue 60 continue This is clearly wrong

Regional Optimization: Improving Loops With loop nest, may unroll outer loop Trick is to unroll outer loop and fuse resulting inner loops COMP 512, Rice University21 do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j) 50 continue 60 continue Critical inner loop from dmxpy in Linpack do 60 j = 1, n2 nextra = mod(n1,4) if (nextra.ge. 1) then do 49 i, nextra, 1 y(i) = y(i) + x(j) * m(i,j) 49 continue do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) + x(j+1) * m(i,j+1) + x(j+2) * m(i,j+2) + x(j+3) * m(i,j+3) 50 continue 60 continue Save on loads & stores of y(i)? Spatial reuse in x and m The author of Linpack, after much testing, chose outer loop unrolling.

Regional Optimization: Improving Loops Other effects of loop unrolling Increases number of independent operations inside loop  May be good for scheduling multiple functional units Moving consecutive accesses into same iteration  Scheduler may move them together ( locality in big loop ) May make cross-iteration redundancies obvious  Expose address expressions in example to LVN May increase demand for registers  Spills can overcome any benefits Can unroll to eliminate copies at end of loop  Often rediscovered result of Ken Kennedy’s thesis Can change other optimizations  Weights in spill code ( Das Gupta’s example ) COMP 512, Rice University22

Regional Optimization: Improving Loops Many other loop transformations appear in the literature We will have a lecture devoted to them later in the course See also COMP 515 and the Allen-Kennedy book Next class Examples of Global Optimization COMP 512, Rice University23