Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.

Slides:



Advertisements
Similar presentations
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Advertisements

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Jeffrey D. Ullman Stanford University. 2  A set of nodes N and edges E is a region if: 1.There is a header h in N that dominates all nodes in N. 2.If.
1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
SSA.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Code Optimization. 2 The Code Optimizer Control flow analysis: control flow graph Data-flow analysis Transformations Front end Code generator Code optimizer.
MINIMUM COST FLOWS: NETWORK SIMPLEX ALGORITHM A talk by: Lior Teller 1.
1 Data flow analysis Goal : collect information about how a procedure manipulates its data This information is used in various optimizations For example,
1 CS 201 Compiler Construction Lecture 2 Control Flow Analysis.
Jeffrey D. Ullman Stanford University Flow Graph Theory.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
1 CS 201 Compiler Construction Lecture 2 Control Flow Analysis.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Advanced Compilers CMPSCI 710 Spring 2003 Lecture 2 Emery Berger University of.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CS 536 Spring Global Optimizations Lecture 23.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Data Flow Analysis Compiler Design Nov. 3, 2005.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Advanced Compilers CMPSCI 710 Spring 2003 Computing SSA Emery Berger University.
4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)
1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.
2015/6/24\course\cpeg421-10F\Topic1-b.ppt1 Topic 1b: Flow Analysis Some slides come from Prof. J. N. Amaral
Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.
CS 412/413 Spring 2007Introduction to Compilers1 Lecture 29: Control Flow Analysis 9 Apr 07 CS412/413 Introduction to Compilers Tim Teitelbaum.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Loops Guo, Yao.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Efficiency of Iterative Algorithms
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
1 Region-Based Data Flow Analysis. 2 Loops Loops in programs deserve special treatment Because programs spend most of their time executing loops, improving.
1 Code Optimization Chapter 9 (1 st ed. Ch.10) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Dataflow Analysis Topic today Data flow analysis: Section 3 of Representation and Analysis Paper (Section 3) NOTE we finished through slide 30 on Friday.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
1 CS 201 Compiler Construction Lecture 2 Control Flow Analysis.
Machine-Independent Optimizations Ⅳ CS308 Compiler Theory1.
1 Code Optimization Chapter 9 (1 st ed. Ch.10) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Loops Simone Campanoni
Data Flow Analysis Suman Jana
CS 201 Compiler Construction
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
CS 201 Compiler Construction
Topic 10: Dataflow Analysis
University Of Virginia
Control Flow Analysis CS 4501 Baishakhi Ray.
Code Optimization Chapter 10
Code Optimization Chapter 9 (1st ed. Ch.10)
Topic 4: Flow Analysis Some slides come from Prof. J. N. Amaral
Code Optimization Overview and Examples Control Flow Graph
Control Flow Analysis (Chapter 7)
Data Flow Analysis Compiler Design
UNIT V Run Time Environments.
Taken largely from University of Delaware Compiler Notes
COMPILERS Liveness Analysis
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Presentation transcript:

Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops

“Advanced Compiler Techniques” Content  Concepts : Dominators Depth - First Ordering Back edges Graph depth Reducibility  Natural Loops  Efficiency of Iterative Algorithms  Dependences & Loop Transformation 2

“Advanced Compiler Techniques” Loops are Important!  Loops dominate program execution time Needs special treatment during optimization  Loops also affect the running time of program analyses e. g., A dataflow problem can be solved in just a single pass if a program has no loops 3

“Advanced Compiler Techniques” Dominators  Node d dominates node n if every path from the entry to n goes through d. written as : d dom n  Quick observations :  Every node dominates itself.  The entry dominates every node.  Common Cases : The test of a while loop dominates all blocks in the loop body. The test of an if - then - else dominates all blocks in either branch. 4

“Advanced Compiler Techniques” Dominator Tree  Immediate dominance : d idom n d dom n, d  n, no m s. t. d dom m and m dom n  Immediate dominance relationships form a tree

“Advanced Compiler Techniques” Finding Dominators  A dataflow analysis problem : For each node, find all of its dominators. Direction : forward Confluence : set intersection Boundary : OUT [ Entry ] = { Entry } Initialization : OUT [ B ] = All nodes Equations :  OUT [ B ] = IN [ B ] U { B }  IN [ B ] =  p is a predecessor of B OUT [ p ] 6

Example: Dominators {1,5} {1,4} {1,2,3} {1,2} {1} {1,2} “Advanced Compiler Techniques”

Depth-First Search  Start at entry.  If you can follow an edge to an unvisited node, do so.  If not, backtrack to your parent ( node from which you were visited ). 8

“Advanced Compiler Techniques” Depth-First Spanning Tree  Root = entry.  Tree edges are the edges along which we first visit the node at the head

“Advanced Compiler Techniques” Depth-First Node Order  The reverse of the order in which a DFS retreats from the nodes  Alternatively, reverse of postorder traversal of the tree

“Advanced Compiler Techniques” Four Kinds of Edges 1. Tree edges. 2. Advancing edges ( node to proper descendant ). 3. Retreating edges ( node to ancestor, including edges to self ). 4. Cross edges ( between two nodes, neither of which is an ancestor of the other. 11

“Advanced Compiler Techniques” A Little Magic  Of these edges, only retreating edges go from high to low in DF order. Example of proof : You must retreat from the head of a tree edge before you can retreat from its tail.  Also surprising : all cross edges go right to left in the DFST. Assuming we add children of any node from the left. 12

Example: Non-Tree Edges Retreating Forward Cross “Advanced Compiler Techniques”

14 Back Edges  An edge is a back edge if its head dominates its tail.  Theorem : Every back edge is a retreating edge in every DFST of every flow graph. Converse almost always true, but not always. Back edge Head reached before tail in any DFST Search must reach the tail before retreating from the head, so tail is a descendant of the head “Advanced Compiler Techniques”

Example: Back Edges {1,5} {1,4} {1,2,3} {1,2} {1} “Advanced Compiler Techniques”

16 Reducible Flow Graphs  A flow graph is reducible if every retreating edge in any DFST for that flow graph is a back edge.  Testing reducibility : Remove all back edges from the flow graph and check that the result is acyclic.  Hint why it works : All cycles must include some retreating edge in every DFST. In particular, the edge that enters the first node of the cycle that is visited. “Advanced Compiler Techniques”

DFST on a Cycle 17 Depth - first search reaches here first Search must reach these nodes before leaving the cycle So this is a retreating edge “Advanced Compiler Techniques”

Why Reducibility?  Folk theorem : All flow graphs in practice are reducible.  Fact : If you use only while - loops, for - loops, repeat - loops, if - then (- else ), break, and continue, then your flow graph is reducible. 18

Example: Remove Back Edges Remaining graph is acyclic. “Advanced Compiler Techniques”

Example: Nonreducible Graph 20 A CB In any DFST, one of these edges will be a retreating edge. A B C A B C But no heads dominate their tails, so deleting back edges leaves the cycle. “Advanced Compiler Techniques”

21 Why Care About Back/Retreating Edges? 1. Proper ordering of nodes during iterative algorithm assures number of passes limited by the number of “ nested ” back edges. 2. Depth of nested loops upper - bounds the number of nested back edges. “Advanced Compiler Techniques”

DF Order and Retreating Edges  Suppose that for a RD analysis, we visit nodes during each iteration in DF order.  The fact that a definition d reaches a block will propagate in one pass along any increasing sequence of blocks.  When d arrives at the tail of a retreating edge, it is too late to propagate d from OUT to IN. The IN at the head has already been computed for that round. 22

Example: DF Order d d d d d d d d d d Definition d is Gen ’ d by node 2. The first pass The second pass “Advanced Compiler Techniques”

Depth of a Flow Graph  The depth of a flow graph with a given DFST and DF - order is the greatest number of retreating edges along any acyclic path.  For RD, if we use DF order to visit nodes, we converge in depth +2 passes. Depth +1 passes to follow that number of increasing segments. 1 more pass to realize we converged. 24

Example: Depth = >4->7 ---> 3->10->17 ---> 6->18->20 increasing retreating increasing retreating Pass 1 Pass 2 Pass 3 “Advanced Compiler Techniques”

Similarly...  AE also works in depth +2 passes. Unavailability propagates along retreat - free node sequences in one pass.  So does LV if we use reverse of DF order. A use propagates backward along paths that do not use a retreating edge in one pass. 26

“Advanced Compiler Techniques” In General...  The depth +2 bound works for any monotone framework, as long as information only needs to propagate along acyclic paths. Example : if a definition reaches a point, it does so along an acyclic path. 27

However...  Constant propagation does not have this property. 28 a = b b = c c = 1 L : a = b b = c c = 1 goto L “Advanced Compiler Techniques”

Why Depth+2 is Good  Normal control - flow constructs produce reducible flow graphs with the number of back edges at most the nesting depth of loops. Nesting depth tends to be small. A study by Knuth has shown that average depth of typical flow graphs = ~

Example: Nested Loops 30 3 nested while - loops ; depth = 3 3 nested repeat - loops ; depth = 1 “Advanced Compiler Techniques”

Natural Loops  A natural loop is defined by : A single entry - point called header  a header dominates all nodes in the loop A back edge that enters the loop header  Otherwise, it is not possible for the flow of control to return to the header directly from the " loop " ; i. e., there really is no loop. 31

“Advanced Compiler Techniques” Find Natural Loops  The natural loop of a back edge a -> b is { b } plus the set of nodes that can reach a without going through b  Remove b from the flow graph, find all predecessors of a  Theorem : two natural loops are either disjoint, identical, or nested. 32

Example: Natural Loops Natural loop of 3 -> 2 Natural loop of 5 -> 1 “Advanced Compiler Techniques”

Relationship between Loops  If two loops do not have the same header they are either disjoint, or one is entirely contained ( nested within ) the other innermost loop : one that contains no other loop.  If two loops share the same header Hard to tell which is the inner loop Combine as one

Basic Parallelism  Examples : FOR i = 1 to 100 a [ i ] = b [ i ] + c [ i ] FOR i = 11 TO 20 a [ i ] = a [ i -1] + 3 FOR i = 11 TO 20 a [ i ] = a [ i -10] + 3  Does there exist a data dependence edge between two different iterations?  A data dependence edge is loop - carried if it crosses iteration boundaries  DoAll loops : loops without loop - carried dependences 35 “Advanced Compiler Techniques”

Data Dependence of Variables  True dependenc e  Anti - dependenc e a = = a a = = a  Output dependenc e  Input dependenc e 36

Affine Array Accesses  Common patterns of data accesses : ( i, j, k are loop indexes ) A [ i ], A [ j ], A [ i -1], A [0], A [ i + j ], A [2* i ], A [2* i +1], A [ i, j ], A [ i -1, j +1]  Array indexes are affine expressions of surrounding loop indexes Loop indexes : i n, i n -1,..., i 1 Integer constants : c n, c n -1,..., c 0 Array index : c n i n + c n -1 i n c 1 i 1 + c 0 Affine expression : linear expression + a constant term ( c 0 ) 37 “Advanced Compiler Techniques”

Formulating Data Dependence Analysis FOR i := 2 to 5 do A [ i -2] = A [ i ]+1; 38  Between read access A [ i ] and write access A [ i -2] there is a dependence if : there exist two iterations ir and iw within the loop bounds, s. t. iterations ir & iw read & write the same array element, respectively ∃ integers i w, i r 2 ≤i w, i r ≤ 5 i r = i w -2  Between write access A [ i -2] and write access A [ i -2] there is a dependence if : ∃ integers i w, i v 2 ≤i w, i v ≤ 5 i w – 2= i v –2 To rule out the case when the same instance depends on itself : add constraint iw ≠ iv “Advanced Compiler Techniques”

Memory Disambiguation  Undecidable at Compile Time read ( n ) For i = … a [ i ] = a [ n ] 39 “Advanced Compiler Techniques”

Domain of Data Dependence Analysis  Only use loop bounds and array indexes that are affine functions of loop variables for i = 1 to n for j = 2 i to 100 a [ i + 2 j + 3][4 i + 2 j ][ i * i ] = … … = a [1][2 i + 1][ j ]  Assume a data dependence between the read & write operation if there exists : ∃ integers i r, j r, i w, j w 1 ≤ i w, i r ≤ n 2 i w ≤ j w ≤ i r ≤ j r ≤ 10 i w + 2 j w + 3 = 1 4 i w + 2 j w = 2 i r + 1  Equate each dimension of array access ; ignore non - affine ones No solution  No data dependence Solution  There may be a dependence 40 “Advanced Compiler Techniques”

Iteration Space 41  An abstracti on for loops  Iteration is represent ed as coordinat es in iteration space. for i= 0, 5 for j = 0, 3 a[i, j] = 3 i j “Advanced Compiler Techniques”

Iteration Space 42  An abstracti on for loops for i = 0, 5 for j = i, 3 a[i, j] = 0 i j “Advanced Compiler Techniques”

Iteration Space 43  An abstracti on for loops for i = 0, 5 for j = i, 7 a[i, j] = 0 i j “Advanced Compiler Techniques”

Affine Access 44 “Advanced Compiler Techniques”

Affine Transform 45 i j u v “Advanced Compiler Techniques”

Loop Transformation 46 for i = 1, 100 for j = 1, 200 A [ i, j ] = A [ i, j ] + 3 end_for for u = 1, 200 for v = 1, 100 A [ v, u ] = A [ v, u ]+ 3 end_for “Advanced Compiler Techniques”

Old Iteration Space 47 for i = 1, 100 for j = 1, 200 A [ i, j ] = A [ i, j ] + 3 end_for “Advanced Compiler Techniques”

New Iteration Space 48 for u = 1, 200 for v = 1, 100 A [ v, u ] = A [ v, u ]+ 3 end_for “Advanced Compiler Techniques”

Old Array Accesses 49 for i = 1, 100 for j = 1, 200 A [ i, j ] = A [ i, j ] + 3 end_for “Advanced Compiler Techniques”

New Array Accesses 50 for u = 1, 200 for v = 1, 100 A [ v, u ] = A [ v, u ]+ 3 end_for “Advanced Compiler Techniques”

Interchange Loops? 51 for i = 2, 1000 for j = 1, 1000 A [ i, j ] = A [ i -1, j +1]+3 end_for e. g. dependence vector d old = (1,-1) i j for u = 1, 1000 for v = 2, 1000 A [ v, u ] = A [ v -1, u +1]+3 end_for “Advanced Compiler Techniques”

Interchange Loops?  A transformation is legal, if the new dependence is lexicographically positive, i. e. the leading non - zero in the dependence vector is positive.  Distance vector (1,-1) = (4,2)- (3,3)  Loop interchange is not legal if there exists dependence (+, -) 52 “Advanced Compiler Techniques”

GCD Test 53  Is there any dependence?  Solve a linear Diophantine equation 2* i w = 2* i r + 1 for i = 1, 100 a [2* i ] = … … = a [2* i +1] + 3 “Advanced Compiler Techniques”

GCD  The greatest common divisor ( GCD ) of integers a 1, a 2, …, a n, denoted gcd ( a 1, a 2, …, a n ), is the largest integer that evenly divides all these integers.  Theorem : The linear Diophantine equation has an integer solution x 1, x 2, …, x n iff gcd ( a 1, a 2, …, a n ) divides c 54 “Advanced Compiler Techniques”

Examples 55  Example 1: gcd (2,-2) = 2. No solutions  Example 2: gcd (24,36,54) = 6. Many solutions “Advanced Compiler Techniques”

Loop Fusion 56 for i = 1, 1000 A [ i ] = B [ i ] + 3 end_for for j = 1, 1000 C [ j ] = A [ j ] + 5 end_for for i = 1, 1000 A [ i ] = B [ i ] + 3 C [ i ] = A [ i ] + 5 end_for Better reuse between A [ i ] and A [ i ] “Advanced Compiler Techniques”

Loop Distribution 57 for i = 1, 1000 A [ i ] = A [ i -1] + 3 end_for for i = 1, 1000 C [ i ] = B [ i ] + 5 end_for for i = 1, 1000 A [ i ] = A [ i -1] + 3 C [ i ] = B [ i ] + 5 end_for 2 nd loop is parallel “Advanced Compiler Techniques”

Register Blocking for j = 1, 2* m for i = 1, 2* n A [ i, j ] = A [ i -1, j ] + A [ i -1, j -1] end_for for j = 1, 2* m, 2 for i = 1, 2* n, 2 A [ i, j ] = A [ i -1, j ] + A [ i -1, j -1] A [ i, j +1] = A [ i -1, j +1] + A [ i -1, j ] A [ i +1, j ] = A [ i, j ] + A [ i, j -1] A [ i +1, j +1] = A [ i, j +1] + A [ i, j ] end_for Better reuse between A [ i, j ] and A [ i, j ] “Advanced Compiler Techniques” 58

Virtual Register Allocation for j = 1, 2* M, 2 for i = 1, 2* N, 2 r 1 = A [ i -1, j ] r 2 = r 1 + A [ i -1, j -1] A [ i, j ] = r 2 r 3 = A [ i -1, j +1] + r 1 A [ i, j +1] = r 3 A [ i +1, j ] = r 2 + A [ i, j -1] A [ i +1, j +1] = r 3 + r 2 end_for Memory operation s reduced to register load / stor e 8 MN loads to 4 MN loads “Advanced Compiler Techniques” 59

Scalar Replacement for i = 2, N +1 = A [ i - 1]+1 A [ i ] = end_for t 1 = A [1] for i = 2, N +1 = t t 1 = A [ i ] = t 1 end_for Eliminate loads and stores for array references “Advanced Compiler Techniques” 60

Unroll-and-Jam for j = 1, 2* M for i = 1, N A [ i, j ] = A [ i -1, j ] + A [ i -1, j -1] end_for for j = 1, 2* M, 2 for i = 1, N A [ i, j ]= A [ i -1, j ]+ A [ i - 1, j -1] A [ i, j +1]= A [ i - 1, j +1]+ A [ i -1, j ] end_for Expose more opportunity for scalar replacement “Advanced Compiler Techniques” 61

Large Arrays for i = 1, 1000 for j = 1, 1000 A [ i, j ] = A [ i, j ] + B [ j, i ] end_for Suppose arrays A and B have row - major layout B has poor cache locality. Loop interchange will not help. “Advanced Compiler Techniques” 62

Loop Blocking for v = 1, 1000, 20 for u = 1, 1000, 20 for j = v, v +19 for i = u, u +19 A [ i, j ] = A [ i, j ] + B [ j, i ] end_for Access to small blocks of the arrays has good cache locality. “Advanced Compiler Techniques” 63

Loop Unrolling for ILP for i = 1, 10 a [ i ] = b [ i ]; * p =... end_for for I = 1, 10, 2 a [ i ] = b [ i ]; * p = … a [ i +1] = b [ i +1]; * p = … end_for Large scheduling regions. Fewer dynamic branches Increased code size “Advanced Compiler Techniques” 64

“Advanced Compiler Techniques” Next Time  Homework 9.6.2, 9.6.4,  Single Static Assignment ( SSA ) Readings : Cytron '91, Chow '97 65