Download presentation
Presentation is loading. Please wait.
Published byBeatrice Morris Modified over 9 years ago
1
Terminology, Principles, and Concerns, III With examples from DOM (Ch 9) and DVNT (Ch 10) Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011
2
Last Lecture Extended Basic Blocks Superlocal value numbering > Treat each path as a single basic block > Use a scoped hash table & SSA names to make it efficient COMP 512, Rice University2
3
This Lecture Dominator Trees Computing dominator information Global data-flow analysis Dominator-based Value Numbering Enhance the Superlocal Value Numbering algorithm so that it can cover more blocks Optimizing a loop nest Finding loop nests Loop unrolling as an initial transformation COMP 512, Rice University3
4
4 This is in SSA Form Superlocal Value Numbering m 0 a + b n 0 a + b A p 0 c + d r 0 c + d B r 2 (r 0,r 1 ) y 0 a + b z 0 c + d G q 0 a + b r 1 c + d C e 0 b + 18 s 0 a + b u 0 e + f D e 1 a + 17 t 0 c + d u 1 e + f E e 3 (e 0,e 1 ) u 2 (u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f F With all the bells & whistles Find more redundancy Pay little additional cost Still does nothing for F & G Superlocal techniques Some local methods extend cleanly to superlocal scopes VN does not back up If C adds to A, it’s a problem
5
COMP 512, Rice University5 What About Larger Scopes? We have not helped with F or G Multiple predecessors Must decide what facts hold in F and in G For G, combine B & F? Merging state is expensive Fall back on what’s known G m 0 a + b n 0 a + b A p 0 c + d r 0 c + d B r 2 (r 0,r 1 ) y 0 a + b z 0 c + d q 0 a + b r 1 c + d C e 0 b + 18 s 0 a + b u 0 e + f D e 1 a + 17 t 0 c + d u 1 e + f E e 3 (e 0,e 1 ) u 2 (u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f F
6
COMP 512, Rice University6 Dominators Definitions x dominates y if and only if every path from the entry of the control-flow graph to the node for y includes x By definition, x dominates x We associate a DOM set with each node |DOM(x )| ≥ 1 Immediate dominators For any node x, there must be a y in DOM(x ) closest to x We call this y the immediate dominator of x As a matter of notation, we write this as IDOM(x )
7
COMP 512, Rice University7 Dominators Dominators have many uses in analysis & transformation Finding loops Building SSA form Making code motion decisions We’ll look at how to compute dominators later A BCG FED Dominator treeDominator sets Back to the discussion of value numbering over larger scopes... * m 0 a + b n 0 a + b A p 0 c + d r 0 c + d B r 2 (r 0,r 1 ) y 0 a + b z 0 c + d G q 0 a + b r 1 c + d C e 0 b + 18 s 0 a + b u 0 e + f D e 1 a + 17 t 0 c + d u 1 e + f E e 3 (e 0,e 1 ) u 2 (u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f F Original idea: R.T. Prosser. “Applications of Boolean matrices to the analysis of flow diagrams,” Proceedings of the Eastern Joint Computer Conference, Spartan Books, New York, pages 133- 138, 1959.
8
COMP 512, Rice University8 What About Larger Scopes? We have not helped with F or G Multiple predecessors Must decide what facts hold in F and in G For G, combine B & F? Merging state is expensive Fall back on what’s known Can use table from IDOM(x ) to start x Use C for F and A for G Imposes a Dom-based application order Leads to Dominator VN Technique ( DVNT ) * m 0 a + b n 0 a + b A p 0 c + d r 0 c + d B r 2 (r 0,r 1 ) y 0 a + b z 0 c + d G q 0 a + b r 1 c + d C e 0 b + 18 s 0 a + b u 0 e + f D e 1 a + 17 t 0 c + d u 1 e + f E e 3 (e 0,e 1 ) u 2 (u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f F
9
COMP 512, Rice University9 Dominator Value Numbering The DVNT Algorithm Use superlocal algorithm on extended basic blocks Retain use of scoped hash tables & SSA name space Start each node with table from its IDOM DVNT generalizes the superlocal algorithm No values flow along back edges ( i.e., around loops ) Constant folding, algebraic identities as before Larger scope leads to ( potentially ) better results LVN + SVN + good start for EBBs missed by SVN
10
COMP 512, Rice University10 Dominator Value Numbering m a + b n a + b A p c + d r c + d B r 2 (r 0,r 1 ) y a + b z c + d G q a + b r c + d C e b + 18 s a + b u e + f D e a + 17 t c + d u e + f E e 3 (e 1,e 2 ) u 2 (u 0,u 1 ) v a + b w c + d x e + f F DVNT advantages Find more redundancy Little additional cost Retains online character DVNT shortcomings Misses some opportunities No loop-carried CSEs or constants
11
COMP 512, Rice University11 Computing Dominators Critical first step in SSA construction and in DVNT A node n dominates m iff n is on every path from n 0 to m Every node dominates itself n’s immediate dominator is its closest dominator, ID OM (n) † D OM (n 0 ) = { n 0 } D OM (n) = { n } ( p preds(n) D OM (p)) Computing DOM These simultaneous set equations define a simple problem in data-flow analysis Equations have a unique fixed point solution An iterative fixed-point algorithm will solve them quickly † ID OM (n ) ≠ n, unless n is n 0, by convention. Initially, D OM (n) = N, n≠n 0
12
COMP 512, Rice University12 Round-robin Iterative Algorithm Termination Makes sweeps over the nodes Halts when some sweep produces no change DOM(b 0 ) Ø for i 1 to N DOM(b i ) { all nodes in graph } change true while (change) change false for i 0 to N T EMP { i } ( x pred (b) DOM(x )) if DOM(b i ) ≠ T EMP then change true DOM(b i ) T EMP
13
COMP 512, Rice University13 Example B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B0B0 Flow Graph Progress of iterative solution for D OM Results of iterative solution for D OM *
14
COMP 512, Rice University14 Example Dominance Tree Progress of iterative solution for D OM Results of iterative solution for D OM B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B0B0 There are asymptotically faster algorithms. With the right data structures, the iterative algorithm can be made extremely fast. See Cooper, Harvey, & Kennedy, on the web site, or algorithm in Chapter 9 of EaC.
15
Aside on Data-Flow Analysis The iterative DOM calculation is an example of data-flow analysis Data-flow analysis is a collection of techniques for compile-time reasoning about the run-time flow of values Data-flow analysis almost always operates on a graph Problems are trivial in a basic block Global problems use the control-flow graph (or derivative) Interprocedural problems use call graph (or derivative) Data-flow problems are formulated as simultaneous equations Sets attached to nodes and edges One solution technique is the iterative algorithm Desired result is usually meet over all paths (MOP) solution “What is true on every path from the entry node?” “Can this event happen on any path from the entry?” COMP 512, Rice University15 Related to safety
16
Aside on Data-Flow Analysis Why did the iterative algorithm work? Termination The DOM sets are initialized to the (finite) set of nodes The DOM sets shrink monotonically The algorithm reaches a fixed point where they stop changing Correctness We can prove that the fixed point solution is also the MOP That proof is beyond today’s lecture, but we’ll revisit it Efficiency The round-robin algorithm is not particularly efficient Order in which we visit nodes is important for efficient solutions COMP 512, Rice University16
17
COMP 512, Rice University17 Regional Optimization: Improving Loops Compilers have always focused on loops Higher execution counts inside loop than outside loops Repeated, related operations Much of the real work takes place in loops (linear algebra) Several effects to attack in a loop or loop nest Overhead Decrease control-structure cost per iteration Locality Spatial locality use of co-resident data Temporal locality ⇒ reuse of same data item Parallelism Move loops with independent iterations to outer position Inner positions for vector hardware & SSE
18
Regional Optimization: Improving Loops Loop unrolling (the oldest trick in the book) To reduce overhead, replicate the loop body Sources of improvement Less overhead per useful operation Longer basic blocks for local optimization COMP 512, Rice University18 do i = 1 to 100 by 1 a(i) = a(i) + b(i) end do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+ 1 ) = a(i+ 1 ) + b(i+ 1 ) a(i+ 2 ) = a(i+ 2 ) + b(i+ 2 ) a(i+ 3 ) = a(i+ 3 ) + b(i+ 3 ) end becomes (unroll by 4)
19
Doesn’t mess up spatial locality on either y or m (column-major order) Regional Optimization: Improving Loops With loop nest, may unroll inner loop COMP 512, Rice University19 do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j) 50 continue 60 continue Critical inner loop from dmxpy in Linpack Doesn’t mess up reuse on x(j) do 60 j = 1, n2 nextra = mod(n1,4) if (nextra.ge. 0) then do 49 i = 1, nextra, 1 y(i) = y(i) + x(j) * m(i,j) 49 continue do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) y(i+1) = y(i+1) + x(j) * m(i+1,j) y(i+2) = y(i+2) + x(j) * m(i+2,j) y(i+3) = y(i+3) + x(j) * m(i+3,j) 50 continue 60 continue
20
Regional Optimization: Improving Loops With loop nest, may unroll outer loop Trick is to unroll outer loop and fuse resulting inner loops Loop fusion combines the bodies of two similar loops COMP 512, Rice University20 do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j) 50 continue 60 continue Critical inner loop from dmxpy in Linpack do 60 j = 1, n2 nextra = mod(n1,4) if (nextra.ge. 1) then do 49 i, nextra, 1 y(i) = y(i) + x(j) * m(i,j) 49 continue do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) y(i) = y(i) + x(j+1) * m(i,j+1) y(i) = y(i) + x(j+2) * m(i,j+2) y(i) = y(i) + x(j+3) * m(i,j+3) 50 continue 60 continue This is clearly wrong
21
Regional Optimization: Improving Loops With loop nest, may unroll outer loop Trick is to unroll outer loop and fuse resulting inner loops COMP 512, Rice University21 do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j) 50 continue 60 continue Critical inner loop from dmxpy in Linpack do 60 j = 1, n2 nextra = mod(n1,4) if (nextra.ge. 1) then do 49 i, nextra, 1 y(i) = y(i) + x(j) * m(i,j) 49 continue do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) + x(j+1) * m(i,j+1) + x(j+2) * m(i,j+2) + x(j+3) * m(i,j+3) 50 continue 60 continue Save on loads & stores of y(i)? Spatial reuse in x and m The author of Linpack, after much testing, chose outer loop unrolling.
22
Regional Optimization: Improving Loops Other effects of loop unrolling Increases number of independent operations inside loop May be good for scheduling multiple functional units Moving consecutive accesses into same iteration Scheduler may move them together ( locality in big loop ) May make cross-iteration redundancies obvious Expose address expressions in example to LVN May increase demand for registers Spills can overcome any benefits Can unroll to eliminate copies at end of loop Often rediscovered result of Ken Kennedy’s thesis Can change other optimizations Weights in spill code ( Das Gupta’s example ) COMP 512, Rice University22
23
Regional Optimization: Improving Loops Many other loop transformations appear in the literature We will have a lecture devoted to them later in the course See also COMP 515 and the Allen-Kennedy book Next class Examples of Global Optimization COMP 512, Rice University23
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.