Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.

Slides:

Advertisements

Similar presentations

Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.

Advertisements

Covers, Dominations, Independent Sets and Matchings AmirHossein Bayegan Amirkabir University of Technology.

ECE 667 Synthesis and Verification of Digital Circuits

§3 Shortest Path Algorithms Given a digraph G = ( V, E ), and a cost function c( e ) for e  E( G ). The length of a path P from source to destination.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

MINIMUM COST FLOWS: NETWORK SIMPLEX ALGORITHM A talk by: Lior Teller 1.

Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)

Chapter 3 The Greedy Method 3.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Graphs. Graph definitions There are two kinds of graphs: directed graphs (sometimes called digraphs) and undirected graphs Birmingham Rugby London Cambridge.

Network Optimization Models: Maximum Flow Problems In this handout: The problem statement Solving by linear programming Augmenting path algorithm.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

1 Parallel Algorithms III Topics: graph and sort algorithms.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Dynamic Programming Jose Rolim University of Geneva.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

1 1 Slide © 2000 South-Western College Publishing/ITP Slides Prepared by JOHN LOUCKS.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Generic Software Pipelining at the Assembly Level Markus Pister

© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.

Operations Research Assistant Professor Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University of Palestine.

1 Network Optimization Chapter 3 Shortest Path Problems.

Network Optimization Models

Introduction to Operations Research

Graph Theory Topics to be covered:

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

TCP Traffic and Congestion Control in ATM Networks

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

Graphs. 2 Graph definitions There are two kinds of graphs: directed graphs (sometimes called digraphs) and undirected graphs Birmingham Rugby London Cambridge.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

1 1 © 2003 Thomson  /South-Western Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.

Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.

Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.

Lagrangean Relaxation

Chapter 8 Network Models to accompany Operations Research: Applications and Algorithms 4th edition by Wayne L. Winston Copyright (c) 2004 Brooks/Cole,

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

Suppose G = (V, E) is a directed network. Each edge (i,j) in E has an associated ‘length’ c ij (cost, time, distance, …). Determine a path of shortest.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

1 Job Shop Scheduling. 2 Job shop environment: m machines, n jobs objective function Each job follows a predetermined route Routes are not necessarily.

::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.

Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.

Principles and Modulo Scheduling Algorithm

St. Edward’s University

Register Pressure Guided Unroll-and-Jam

MATS Quantitative Methods Dr Huw Owens

Code Optimization Overview and Examples Control Flow Graph

Chapter 9 Graph algorithms

Presentation transcript:

Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique that reforms the loop, so as to achieve an overlapped iteration execution

Process Overview Parallelize a single operation or the whole loop? More parallelism achievable if we consider the entire loop Construct Instructions that contain operations from different iterations of the initial loop Construct a flat schedule, and repeat it over the time taking into account resource and dependence constraints.

Techniques Software pipelining restructures loops in order to achieve overlapping of various iterations in time Although this optimization does not create massive amounts of parallelism, it is desirable There exist two main methods for software pipelining: kernel recognition and modulo scheduling

Modulo Scheduling We will focus on modulo scheduling technique (it is incorporated in commercial compilers) We try to select a schedule for one loop iteration and then repeat the schedule No unrolling applied

Terminology (Dependences) To make a legal schedule, it is important to know which operations must follow other operations resource/hardware constraints A conflict exists if two operations cannot execute, at the same time, but it does not matter which one executes first (resource/hardware constraints) data/control dependences A dependence exists between two operations if interchanging their order changes the result (data/control dependences)

Terminology (Data Dependence Graph) Represent operations as nodes and dependences between operations as directed arcs Loop carried arcs show relationships between operations of different iterations (may turn DDG into cyclic graph) Loop independent arcs represent a must follow relationship among operations of the same iteration dif, min Assign arc weights with the form of a (dif, min) dependence pair. dif value indicates the number of iterations the dependence spans min time which is the time to elapse between consecutive execution of the dependent operations Value (min/dif) is called the slope of the schedule

Terminology (Resource Reservation Table) Construct Resource Reservation Table

Terminology (Loop Types) Doall loop in which iterations can proceed in parallel. Those type of loops lead to massive parallelism and are easy to schedule Doacross loop in which synchronization is needed between operations of various iterations

Doall Loop Example dif=0, no loop-carried dependences min=1, loop-independent dependences Construct a valid flat schedule. Then, repeat it

Doacross Loop Example (dif=1) dif=1 for Operation1 (loop- carried dependences exist) min=1, loop-independent dependences Construct a valid flat schedule. Then, repeat it However, repetition is not easy We should take into account that dif=1 for O1. Each Iteration should start with one slot delay A legal scheduled has been achieved

Doacross Loop Example (dif=2) dif=2 for Operation1 (loop- carried dependences exist) min=1, loop-independent dependences Each second Iteration should now start with one slot delay from the previous This is because dif=2, dependence is deeper (that is less restrictive)

Comparison In our first example where dif=1 and min=1, the kernel is found in the 4 th time slot and is equal to Instructions before and after the kernel are defined as the prelude and postlude of the schedule, respectively In the second example the loop carried dependence is between iterations that are two apart. This is a less restrictive constraint, so iterations are overlapped more. Indeed, the kernel now is

Main Idea Let’s combine all these concepts (data dependence graph, resource reservation tables, schedule, loop types, arcs, flat schedule) in some simple examples Don’t forget that the main idea behind software pipelining (incl. modulo scheduling) is that the body of a loop can be reformed so as to start one loop iteration before previous iterations have finished

Another Loop Example O1 is always scheduled in the first time step. Thus the distance between O1 and the rest of the operations increases in successive iterations. cyclic pattern (such as those achievable in other examples) never forms.

Initiation Interval So far we have described the first step of modulo scheduling procedure, that is analysis of DDG for a loop to identify all kinds of dependences Second step is to try identify the minimum number of instructions required between initiating execution of successive loop iterations Specifically, the delay between iterations of the new loop is called the Initiation Interval (II) a) Resource Constrained II b) Dependence Constrained II

Resource Constrained II res The resource usage imposes a lower bound on the initiation interval (II res ). For each resource, compute the schedule length necessary to accommodate uses of that resource. maximum usage for every resource If we have a DDG and 4 available resources, we try to calculate the maximum usage for every resource

Example Resource 2 is required 4 times. e.g. 2 can only be executed 4 cycles after its previous execution Suppose that the Flat schedule is as shown. We repeat it with 4 time slots delay

Methods for computing II dep Modulo scheduling is all about calculating the lower bound for the initiation interval We will present two techniques to compute the dependence constrained II (the calculation of II res is straightforward) 1) Shortest Path Algorithm 2) Iterative Shortest Path

1) Shortest Path Algorithm This method uses transitive closure of a graph which is a reachability relationship Let θ be a cyclic path from a node to itself, min θ be the sum of the min times on the arcs that constitute the cycle and dif θ be the sum of dif times on the arcs So the time between execution of a node and itself depends on II, e.g. time elapsed between execution of α node and another copy dif θ iterations away is II * dif θ The maximum min/dif for all cycles is the II dep

Let’s see the effect of II on cyclic times in this figure II must be large enough so that II * dif θ >= min θ Repeating Flat schedule

Calculate II dep II * dif θ >= min θ  0 >= min θ - II * dif θ  0 >= min θ - II dep * dif θ  Therefore, we select II: II = max(II dep, II res )?

Shortest Path Algorithm Example Transitive closure of the graph

2) Iterative Shortest Path for each possible II Simplify the previous method by recomputing the transitive closure of the graph for each possible II distance (M ab Use the term of distance (M ab ) between two nodes

In the flat schedule (relative scheduling of each operation of the original iteration, something like list scheduling), the distance between two nodes a, b joint by an arc whose weight is (dif, min) is given by: M a,b = min – II * dif dependent on the initiation interval We want to compute the minimum distance two nodes must be separated, but this information is dependent on the initiation interval Distance M a,b

Effect of II on node precedence

Procedure to find II Construct a matrix M where each entry M i,j represents the min time between two subsequent nodes i and j This computation gives the earliest time node j can be placed with respect to node i in the flat schedule Estimate that II=2

Procedure to find II The next step is to compute matrix M 2 which represents the minimum time difference between nodes of length two Continue by calculating matrix M 3 and so on. Finally, we compute Γ(Μ) as following

Example (II=1)

Example (II=2)

Example (II=3)

Final Result Γ(Μ) represents the maximum distance between each pair of nodes considering paths of all lengths entries main diagonal are non-positive A legal II will produce a closure matrix in which entries on the main diagonal are non-positive Positive values on the diagonal is an indication of a small initiation interval Negative values across the diagonal indicate an adequate estimate of II

Plus or minus Drawback of this method is that before we are able to construct the matrix M, we should estimate II However, this technique allows us to tell if the estimate for II is large enough or need to iteratively try larger II

Why use “modulo” term in the first place? Initially, we have the flat schedule F, consisting of location F 1, F 2 …. Kernel K is formed by overlapping copies of F offset by II Modulo scheduling results when all operations from locations in the flat schedule that have the same value modulo II are executed simultaneously

Operations from (F i : i mod II = 1) execute together Operations from (F i : i mod II = 0) execute together

Example 1 DO I=1, 100 a[I] = b[I-1] + 5; b[I] = a[I] I; // mul->2clocks c[I] = a[I-1] b[I]; d[I] = c[I]; ENDDO Graph of example 1

Example 1 PRODUCE THE FLAT SCHEDULE S1 and S2 are strongly connected Which one should be placed earlier in the Flat schedule? S3 and S4 should be placed after S1 and S2 (S3 should precede S4) Eliminate all loop-carried dependences Loop-Independent arcs determine the sequence of nodes in the flat schedule Flat schedule is therefore:

Example 1 COMPUTE II Using the method of “Shortest Path Algorithm” Find Strongly connected components Calculate the transitive closure of the graph Table ITransitive closure Destination Node Source Node 12 1 (1.3)(0.1) 2 (1.2)(1.3) II = max (3/1, 3/1) = 3.

Example 1 Execution schedule Each blue box represents the kernel of the pipeline Worst case scenario: E=0 when II=length of the flat schedule => no overlapping between adjacent operations

EXAMPLE 2 DO I=1, 100 S1: a[I] = b[I-1] + 5; S2: b[I] = a[I] I; S3: c[I] = a[I-1] b[I]; S4: d[I] = c[I] + e[I-2]; S5: e[I] = d[I] f[I-1]; S6: f[I] = d[I] 4; ENDDO

EXAMPLE 2 PRODUCE THE FLAT SCHEDULE Eliminate all loop-carried dependences Loop-Independent arcs determine the sequence of nodes in the flat schedule Flat schedule is therefore:

EXAMPLE 2 Source NodeTransitive Closure 456 4(2,3),(2,5)(0,1),(0,3)(0,1) 5 (2,2) (2,3),(2,5 (2,3) 6 (2,4) (0,2) (2,5) COMPUTE II using the method of “Shortest Path Algorithm ” Find Strongly connected components Initiation Interval for the first graph is II=3 Calculate the transitive closure of the second graph II = max (3/2, 5/2) = 2.5 II total = max (2.5, 3) = 3

EXAMPLE 2 Execution code of example 2 Kernel in blue box

EXAMPLE 3 Nodes S1, S2 comprise a strongly connected component. II is therefore:

EXAMPLE 3 Produce the Flat schedule. Eliminate all loop-carried dependences There is not a loop-independent arc across all nodes In this case, flat schedule cannot be produced just by following the loop-independent arc We need a global method to generate the flat schedule. The pre-mentioned method does not always work Introduce “Modulo scheduling via hierarchical reduction”

Modulo Scheduling Via Hierarchical Reduction modify the DDG so as to schedule the strongly connected components of the graph first strongly connected components of a graph can be found using Tarjan’s algorithm afterwards, schedule the acyclic DDG

Modulo Scheduling Via Hierarchical Reduction compute the upper and low bounds where each node can be placed in the flat schedule using Equations below Its in an iterative method We begin with II=1 trying to find a legal schedule If it is not possible, II is incremented until all nodes are placed in the flat schedule in legal positions Equations to initialize low and upper bounds Equations to update low and upper bounds Cost II (v, u) stands for the cost (measured by dif, min values) in order node v to reach node u. We need thus the cost matrix for the strongly connected nodes (i.e. the transitive closure).

EXAMPLE 4 DO I=1,100 S1: a(i) = c(i-1) + d(i-3); S2: b(i) = a(i) 5; S3: c(i) = b(i-2) d(i-1); S4: d(i) = c(i) + i; S5: e(i) = d(i); S6: f(i) = d(i-1) i; S7: g(i) = f(i-1); ENDDO  Find strongly connected components  Compute the transitive closure Destination Node Source Node1234 1(3,5), (5,6)(0,1)(2,3)(2,5) 2(3,4), (5,5)(3,5), (5,6)(2,2)(2,4) 3(1,2), (3,3)(1,3), (3,4)(3,5), (1,3), (5,6)(0,2) 4(3,1), (2,3)(3,2), (2,4)(1,1), (5,4)(1,3), (5,6)

EXAMPLE 4 Destination Node Source Node1234 1(3,5), (5,6)(0,1)(2,3)(2,5) 2(3,4), (5,5)(3,5), (5,6)(2,2)(2,4) 3(1,2), (3,3)(1,3), (3,4)(3,5), (1,3), (5,6)(0,2) 4(3,1), (2,3)(3,2), (2,4)(1,1), (5,4)(1,3), (5,6)  Compute the Simplified Transitive Closure by keeping the values that give the maximum distance  Initialize the upper and low bounds for the nodes in the strongly connected component Simplified Transitive Closure where cost II (u,v) = M II ab ( min, dif u→v ) where σ(v) is the time slot in F where the scheduled node has been placed

EXAMPLE 4 ( Initialize nodes ) Destination Node Source Node1234 1(3,5), (5,6)(0,1)(2,3)(2,5) 2(3,4), (5,5)(3,5), (5,6)(2,2)(2,4) 3(1,2), (3,3)(1,3), (3,4)(3,5), (1,3), (5,6)(0,2) 4(3,1), (2,3)(3,2), (2,4)(1,1), (5,4)(1,3), (5,6)  Initialize the upper and low bounds for the nodes in the strongly connected component Simplified Transitive Closure

EXAMPLE 4 ( Schedule the first node ) S1:[ -1,∞ ] S2:[ 1,∞ ] S3:[ -2,∞ ] S4:[ 2,∞ ]  Node S3 has the lowest low bound so it is scheduled first. It is placed in time slot 0 (t0).  Afterwards, we need to update low and upper bounds for the rest nodes

EXAMPLE 4 ( Update nodes S1, S2, S4 ) S1:[ 0, 3 ] S2:[ 1,4 ] S4:[ 2,2 ] Node S2 has the lowest upper bound, so it is scheduled first and is placed in the position indicated by the value of low bound (i.e. t2). Then, we need to update nodes S1, S2.

EXAMPLE 4 ( Update nodes S1, S2 ) S1:[0,3] S2:[1,4] Node S3 has the lowest upper bound, so it is scheduled first and is placed in the position indicated by the value of low bound (i.e. t0). Finally, node S2 is placed in time slot 1 (t1).

EXAMPLE 4 ( Condensed Graph ) The acyclic graph is then easy to be scheduled, using the equation below, where σ(α) is the position in the flat schedule of the node on which node n depends. Min and dif values are the sum of mins and difs across the arcs that connect the corresponding nodes. After scheduling the strongly connected component, the new (condensed) graph is shown below

EXAMPLE 4 ( Condensed Graph ) The acyclic graph is then easy to be scheduled, using the equation below, where σ(α) is the position in the flat schedule of the node on which node n depends. Min and dif values are the sum of mins and difs across the arcs that connect the corresponding nodes. After scheduling the strongly connected component, the new (condensed) graph is shown below

EXAMPLE 4 ( Execution Schedule ) The execution schedule is then