EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Organization and Architecture
1 Optimizing compilers Managing Cache Bercovici Sivan.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Lecture 6 Programming the TMS320C6x Family of DSPs.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 CS 201 Compiler Construction Machine Code Generation.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.
Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Program Representations. Representing programs Goals.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Register Allocation (via graph coloring)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
EECS 583 – Lecture 15 Machine Information, Scheduling a Basic Block University of Michigan March 5, 2003.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine. Interpreter of the virtual machine is invoked to execute the.
Dataflow IV: Loop Optimizations, Exam 2 Review EECS 483 – Lecture 26 University of Michigan Wednesday, December 6, 2006.
What’s in an optimizing compiler?
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
EECS 583 – Class 9 Classic and ILP Optimization University of Michigan October 5, 2011.
EECS 583 – Class 11 Instruction Scheduling University of Michigan October 12, 2011.
EECS 583 – Class 8 Classic Optimization University of Michigan October 3, 2011.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Code Optimization.
Optimizing Compilers Background
Code Generation Part III
Optimizing Transformations Hal Perkins Autumn 2011
Instruction Scheduling Hal Perkins Winter 2008
Optimizing Transformations Hal Perkins Winter 2008
Code Generation Part III
Instruction Level Parallelism (ILP)
EECS 583 – Class 8 Classic Optimization
EECS 583 – Class 11 Instruction Scheduling
EECS 583 – Class 10 Finish Optimization Intro. to Code Generation
EECS 583 – Class 8 Finish SSA, Start on Classic Optimization
EECS 583 – Class 4 If-conversion
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Dynamic Hardware Prediction
EECS 583 – Class 12 Superblock Scheduling Intro. to Loop Scheduling
EECS 583 – Class 9 Classic and ILP Optimization
EECS 583 – Class 4 If-conversion
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012

- 1 - Reading Material + Announcements v Places to find papers on compilers »Conferences: PLDI, CGO, VEE, CC, Micro, Asplos »Journals: ACM Transactions on Computer Architecture and Code Optimization, Software Practice & Experience, IEEE Transactions on Computers, ACM Transactions on Programming Languages and Systems v Today’s class »“Machine Description Driven Compilers for EPIC Processors”, B. Rau, V. Kathail, and S. Aditya, HP Technical Report, HPL , (long paper but informative) v Next class »“The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors,” P. Chang et al., IEEE Transactions on Computers, 1995, pp

- 2 - Course Project – Time to Start Thinking About This v Mission statement: Design and implement something “interesting” in a compiler »LLVM preferred, but others are fine »Groups of 2-3 people (1 or 4 persons is possible in some cases) »Extend existing research paper or go out on your own v Topic areas »Automatic parallelization/SIMDization »Optimizing for GPUs »Targeting VLIW processors »Memory system optimization »Reliability »Energy »For the adventurous  Dynamic optimization  Streaming applications  Binary optimization

- 3 - Course Projects – Timetable v Now »Start thinking about potential topics, identify group members v Oct 22-26: Informal project proposals »No class that week »James and I will meet with each group, slot signups next wk »Ideas/proposal discussed at meeting »Written proposal (a paragraph or 2 plus some references) due Monday, Oct 29 v Late Nov: Project checkpoint »Update on your progress, what left to do v Dec 12-18: Project demos »Each group, 30 min slot »Presentation/Demo/whatever you like »Turn in short report on your project

- 4 - Sample Project Ideas (in random order) v Memory system »Cache profiler for LLVM IR – miss rates, stride determination »Data cache prefetching »Data layout for improved cache behavior v Optimizing for GPUs »Dumb OpenCL/CUDA  smart OpenCL/CUDA – selection of threads/blocks and managing on-chip memory »Reducing uncoalesced memory accesses – measurement of uncoalesced accesses, code restructuring to reduce these »Matlab  CUDA/OpenCL »StreamIt to CUDA/OpenCL (single or multiple GPUs) v Reliability »AVF profiling, reducing vulnerability to soft errors »Coarse-grain code duplication for soft error protection

- 5 - More Project Ideas v Parallelization/SIMDization »CUDA/OpenCL/Matlab/OpenMP to x86+SSE or Arm+NEON »DOALL loop parallelization, dependence breaking transformations »Stream parallelism analysis tool - profiling tool to identify maximal program regions the exhibit stream (or DOALL) parallelism »Access-execute program decomposition - decompose program into memory access/compute threads v Dynamic optimization (Dynamo, Dalvik VM) »Run-time DOALL loop parallelization »Run-time program analysis for security (taint tracking) »Run-time classic optimization (if they don’t have it already!) v Profiling tools »Distributed control flow/energy profiling (profiling + aggregation)

- 6 - And More Project Ideas v VLIW »Hyperblock formation in LLVM »Acyclic BB scheduler for LLVM »Modulo BB scheduler for LLVM v Binary optimizer »Arm binary to LLVM IR, de-register allocation »X86 binary (clean) to LLVM IR v Energy »Minimizing instruction bit flips v Misc »Program distillation - create a subset program with equivalent memory/branch behavior »Code sharing analysis for mobile applications

- 7 - From Last Time: Class Problem Solution Optimize this applying 1. dead code elimination 2. forward copy propagation 3. CSE r4 = r1 r6 = r15 r2 = r3 * r4 r8 = r2 + r5 r9 = r3 r7 = load(r2) if (r2 > r8) r5 = r9 * r4 r11 = r2 r12 = load(r11) if (r12 != 0) r3 = load(r2) r10 = r3 / r6 r11 = r8 store (r11, r7) store (r12, r3) r2 = r3 * r1 r8 = r2 + r5 r7 = load(r2) if (r2 > r8) if (r7 != 0) r3 = r7 store (r8, r7) store (r12, r3)

- 8 - Induction Variable Strength Reduction v Create basic induction variables from derived induction variables v Induction variable »BIV (i++)  0,1,2,3,4,... »DIV (j = i * 4)  0, 4, 8, 12, 16,... »DIV can be converted into a BIV that is incremented by 4 v Issues »Initial and increment vals »Where to place increments r5 = r4 - 3 r4 = r4 + 1 r7 = r4 * r9 r6 = r4 << 2

- 9 - Induction Variable Strength Reduction (2) v Rules »X is a *, <<, + or – operation »src1(X) is a basic ind var »src2(X) is invariant »No other ops modify dest(X) »dest(X) != src(X) for all srcs »dest(X) is a register v Transformation »Insert the following into the preheader  new_reg = RHS(X) »If opcode(X) is not add/sub, insert to the bottom of the preheader  new_inc = inc(src1(X)) opcode(X) src2(X) »else  new_inc = inc(src1(X)) »Insert the following at each update of src1(X)  new_reg += new_inc »Change X  dest(X) = new_reg r5 = r4 - 3 r4 = r4 + 1 r7 = r4 * r9 r6 = r4 << 2

Homework Problem Optimize this applying induction var str reduction r5 = r5 + 1 r11 = r5 * 2 r10 = r r12 = load (r10+0) r9 = r1 << 1 r4 = r r3 = load(r4+4) r3 = r3 + 1 store(r4+0, r3) r7 = r3 << 2 r6 = load(r7+0) r13 = r2 - 1 r1 = r1 + 1 r2 = r2 + 1 r1 = 0 r2 = 0 r13, r12, r6, r10 liveout

ILP Optimization v Traditional optimizations »Redundancy elimination »Reducing operation count v ILP (instruction-level parallelism) optimizations »Increase the amount of parallelism and the ability to overlap operations »Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion) v ILP increased by breaking dependences »True or flow = read after write dependence »False or (anti/output) = write after read, write after write

Global Register Renaming v Straight-line code strategy does not work »A single use may have multiple reaching defs v Web = Collection of defs/uses which have possible value flow between them »Identify webs  Take a def, add all uses  Take all uses, add all reaching defs  Take all defs, add all uses  repeat until stable soln »Each web renamed if name is the same as another web x = y = = y = x x = y = = y= x y = = y

Back Substitution v Generation of expressions by compiler frontends is very sequential »Account for operator precedence »Apply left-to-right within same precedence v Back substitution »Create larger expressions  Iteratively substitute RHS expression for LHS variable »Note – may correspond to multiple source statements »Enable subsequent optis v Optimization »Re-compute expression in a more favorable manner r9 = r1 + r2 r10 = r9 + r3 r11 = r10 - r4 r12 = r11 + r5 r13 = r12 – r6 Subs r12: r13 = r11 + r5 – r6 Subs r11: r13 = r10 – r4 + r5 – r6 Subs r10 r13 = r9 + r3 – r4 + r5 – r6 Subs r9 r13 = r1 + r2 + r3 – r4 + r5 – r6 y = a + b + c – d + e – f;

Tree Height Reduction v Re-compute expression as a balanced binary tree »Obey precedence rules »Essentially re-parenthesize »Combine literals if possible v Effects »Height reduced (n terms)  n-1 (assuming unit latency)  ceil(log2(n)) »Number of operations remains constant »Cost  Temporary registers “live” longer »Watch out for  Always ok for integer arithmetic  Floating-point – may not be!! r9 = r1 + r2 r10 = r9 + r3 r11 = r10 - r4 r12 = r11 + r5 r13 = r12 – r6 r13 = r1 + r2 + r3 – r4 + r5 – r6 r1 + r2r3 – r4r5 – r6 + + t1 = r1 + r2 t2 = r3 – r4 t3 = r5 – r6 t4 = t1 + t2 r13 = t4 + t3 r13 after back subs: original: final code:

Class Problem Assume: + = 1, * = 3 0 r1 0 r2 0 r3 1 r4 2 r5 0 r6 operand arrival times r10 = r1 * r2 r11 = r10 + r3 r12 = r11 + r4 r13 = r12 – r5 r14 = r13 + r6 Back susbstitute Re-express in tree-height reduced form Account for latency and arrival times

Optimizing Unrolled Loops r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 Unroll = replicate loop body n-1 times. Hope to enable overlap of operation execution from different iterations Not possible! loop: unroll 3 times

Register Renaming on Unrolled Loop r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:

Register Renaming is Not Enough! v Still not much overlap possible v Problems »r2, r4, r6 sequentialize the iterations »Need to rename these v 2 specialized renaming optis »Accumulator variable expansion (r6) »Induction variable expansion (r2, r4) r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:

Accumulator Variable Expansion v Accumulator variable »x = x + y or x = x – y »where y is loop variant!! v Create n-1 temporary accumulators v Each iteration targets a different accumulator v Sum up the accumulator variables at the end v May not be safe for floating- point values r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r16 = r16 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26

Induction Variable Expansion v Induction variable »x = x + y or x = x – y »where y is loop invariant!! v Create n-1 additional induction variables v Each iteration uses and modifies a different induction variable v Initialize induction variables to init, init+step, init+2*step, etc. v Step increased to n*original step v Now iterations are completely independent !! r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r r4 = r r11 = load(r12) r13 = load(r14) r15 = r11 * r13 r16 = r16 + r15 r12 = r r14 = r r21 = load(r22) r23 = load(r24) r25 = r21 * r23 r26 = r26 + r25 r22 = r r24 = r if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26 r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8

Better Induction Variable Expansion v With base+displacement addressing, often don’t need additional induction variables »Just change offsets in each iterations to reflect step »Change final increments to n * original step r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r11 = load(r2+4) r13 = load(r4+4) r15 = r11 * r13 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 r26 = r26 + r25 r2 = r r4 = r if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26

Homework Problem r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: Optimize the unrolled loop Renaming Tree height reduction Ind/Acc expansion

New Topic - Code Generation v Map optimized “machine-independent” assembly to final assembly code v Input code »Classical optimizations »ILP optimizations »Formed regions (sbs, hbs), applied if-conversion (if appropriate) v Virtual  physical binding »2 big steps »1. Scheduling  Determine when every operation executions  Create MultiOps »2. Register allocation  Map virtual  physical registers  Spill to memory if necessary

What Do We Need to Schedule Operations? v Information about the processor »Number of resources »Which resources are used by each operation »Operation latencies »Operand encoding limitations »For example:  2 issue slots, 1 memory port, 1 adder/multiplier  load = 2 cycles, add = 1 cycle, mpy = 3 cycles; all fully pipelined  Each operand can be register or 6 bit signed literal v Ordering constraints amongst operations »What order defines correct program execution? »Need a precedence graph – flow, anti, output deps  What about memory deps? control deps? Delay slots?

How Do We Schedule? v When is it legal to schedule an instruction? »Correct execution is maintained »Resources not oversubscribed v Given multiple operations that can be scheduled, how do you pick the best one? »How do you know it is the best one?  What about a good guess?  Does it matter, just pick one at random? »Are decisions final?, or is this an iterative process? v How do we keep track of resources that are busy/free »Need a reservation table  Matrix (resources x time) r1 = load(r10) r2 = load(r11) r3 = r1 + 4 r4 = r1 – r12 r5 = r2 + r4 r6 = r5 + r3 r7 = load(r13) r8 = r7 * 23 store (r8, r6)

More Stuff to Worry About v Model more resources »Register ports, output busses »Non-pipelined resources v Dependent memory operations v Multiple clusters »Cluster = group of FUs connected to a set of register files such that an FU in a cluster has immediate access to any value produced within the cluster »Multicluster = Processor with 2 or more clusters, clusters often interconnected by several low-bandwidth busses  Bottom line = Non-uniform access latency to operands v Scheduler has to be fast »NP complete problem »So, need a heuristic strategy v What is better to do first, scheduling or register allocation?

Schedule Before or After Register Allocation? r1 = load(r10) r2 = load(r11) r3 = r1 + 4 r4 = r1 – r12 r5 = r2 + r4 r6 = r5 + r3 r7 = load(r13) r8 = r7 * 23 store (r8, r6) R1 = load(R1) R2 = load(R2) R5 = R1 + 4 R1 = R1 – R3 R2 = R2 + R1 R2 = R2 + R5 R5 = load(R4) R5 = R5 * 23 store (R5, R2) physical registers virtual registers Too many artificial ordering constraints if schedule after allocation!!!! But, need to schedule after allocation to bind spill code Solution, do both! Prepass schedule, register allocation, postpass schedule

Data Dependences v Data dependences »If 2 operations access the same register, they are dependent »However, only keep dependences to most recent producer/consumer as other edges are redundant »Types of data dependences Flow OutputAnti r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 r1 = r2 + r3 r2 = r5 * 6

More Dependences v Memory dependences »Similar as register, but through memory »Memory dependences may be certain or maybe v Control dependences »We discussed this earlier »Branch determines whether an operation is executed or not »Operation must execute after/before a branch »Note, control flow (C0) is not a dependence Mem-flow Mem-outputMem-anti store (r1, r2) r3 = load(r1) store (r1, r2) store (r1, r3) r2 = load(r1) store (r1, r3) Control (C1) if (r1 != 0) r2 = load(r1)

Dependence Graph v Represent dependences between operations in a block via a DAG »Nodes = operations »Edges = dependences v Single-pass traversal required to insert dependences v Example 1: r1 = load(r2) 2: r2 = r1 + r4 3: store (r4, r2) 4: p1 = cmpp (r2 < 0) 5: branch if p1 to BB3 6: store (r1, r2) BB3:

Dependence Edge Latencies v Edge latency = minimum number of cycles necessary between initiation of the predecessor and successor in order to satisfy the dependence v Register flow dependence, a  b »Latest_write(a) – Earliest_read(b) (earliest_read typically 0) v Register anti dependence, a  b »Latest_read(a) – Earliest_write(b) + 1 (latest_read typically equal to earliest_write, so anti deps are 1 cycle) v Register output dependence, a  b »Latest_write(a) – Earliest_write(b) + 1 (earliest_write typically equal to latest_write, so output deps are 1 cycle) v Negative latency »Possible, means successor can start before predecessor »We will only deal with latency >= 0, so MAX any latency with 0

Dependence Edge Latencies (2) v Memory dependences, a  b (all types, flow, anti, output) »latency = latest_serialization_latency(a) – earliest_serialization_latency(b) + 1 (generally this is 1) v Control dependences »branch  b  Op b cannot issue until prior branch completed  latency = branch_latency »a  branch  Op a must be issued before the branch completes  latency = 1 – branch_latency (can be negative)  conservative, latency = MAX(0, 1-branch_latency)

Class Problem r1 = load(r2) r2 = r2 + 1 store (r8, r2) r3 = load(r2) r4 = r1 * r3 r5 = r5 + r4 r2 = r6 + 4 store (r2, r5) machine model latencies add: 1 mpy: 3 load: 2 sync 1 store: 1 sync 1 1. Draw dependence graph 2. Label edges with type and latencies