EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012.

EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012

- 1 - Reading Material + Announcements v Places to find papers on compilers »Conferences: PLDI, CGO, VEE, CC, Micro, Asplos »Journals: ACM Transactions on Computer Architecture and Code Optimization, Software Practice & Experience, IEEE Transactions on Computers, ACM Transactions on Programming Languages and Systems v Today’s class »“Machine Description Driven Compilers for EPIC Processors”, B. Rau, V. Kathail, and S. Aditya, HP Technical Report, HPL- 98-40, 1998. (long paper but informative) v Next class »“The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors,” P. Chang et al., IEEE Transactions on Computers, 1995, pp. 353-370.

- 2 - Course Project – Time to Start Thinking About This v Mission statement: Design and implement something “interesting” in a compiler »LLVM preferred, but others are fine »Groups of 2-3 people (1 or 4 persons is possible in some cases) »Extend existing research paper or go out on your own v Topic areas »Automatic parallelization/SIMDization »Optimizing for GPUs »Targeting VLIW processors »Memory system optimization »Reliability »Energy »For the adventurous  Dynamic optimization  Streaming applications  Binary optimization

- 3 - Course Projects – Timetable v Now »Start thinking about potential topics, identify group members v Oct 22-26: Informal project proposals »No class that week »James and I will meet with each group, slot signups next wk »Ideas/proposal discussed at meeting »Written proposal (a paragraph or 2 plus some references) due Monday, Oct 29 v Late Nov: Project checkpoint »Update on your progress, what left to do v Dec 12-18: Project demos »Each group, 30 min slot »Presentation/Demo/whatever you like »Turn in short report on your project

- 4 - Sample Project Ideas (in random order) v Memory system »Cache profiler for LLVM IR – miss rates, stride determination »Data cache prefetching »Data layout for improved cache behavior v Optimizing for GPUs »Dumb OpenCL/CUDA  smart OpenCL/CUDA – selection of threads/blocks and managing on-chip memory »Reducing uncoalesced memory accesses – measurement of uncoalesced accesses, code restructuring to reduce these »Matlab  CUDA/OpenCL »StreamIt to CUDA/OpenCL (single or multiple GPUs) v Reliability »AVF profiling, reducing vulnerability to soft errors »Coarse-grain code duplication for soft error protection

- 5 - More Project Ideas v Parallelization/SIMDization »CUDA/OpenCL/Matlab/OpenMP to x86+SSE or Arm+NEON »DOALL loop parallelization, dependence breaking transformations »Stream parallelism analysis tool - profiling tool to identify maximal program regions the exhibit stream (or DOALL) parallelism »Access-execute program decomposition - decompose program into memory access/compute threads v Dynamic optimization (Dynamo, Dalvik VM) »Run-time DOALL loop parallelization »Run-time program analysis for security (taint tracking) »Run-time classic optimization (if they don’t have it already!) v Profiling tools »Distributed control flow/energy profiling (profiling + aggregation)

- 6 - And More Project Ideas v VLIW »Hyperblock formation in LLVM »Acyclic BB scheduler for LLVM »Modulo BB scheduler for LLVM v Binary optimizer »Arm binary to LLVM IR, de-register allocation »X86 binary (clean) to LLVM IR v Energy »Minimizing instruction bit flips v Misc »Program distillation - create a subset program with equivalent memory/branch behavior »Code sharing analysis for mobile applications

- 7 - From Last Time: Class Problem Solution Optimize this applying 1. dead code elimination 2. forward copy propagation 3. CSE r4 = r1 r6 = r15 r2 = r3 * r4 r8 = r2 + r5 r9 = r3 r7 = load(r2) if (r2 > r8) r5 = r9 * r4 r11 = r2 r12 = load(r11) if (r12 != 0) r3 = load(r2) r10 = r3 / r6 r11 = r8 store (r11, r7) store (r12, r3) r2 = r3 * r1 r8 = r2 + r5 r7 = load(r2) if (r2 > r8) if (r7 != 0) r3 = r7 store (r8, r7) store (r12, r3)

- 8 - Induction Variable Strength Reduction v Create basic induction variables from derived induction variables v Induction variable »BIV (i++)  0,1,2,3,4,... »DIV (j = i * 4)  0, 4, 8, 12, 16,... »DIV can be converted into a BIV that is incremented by 4 v Issues »Initial and increment vals »Where to place increments r5 = r4 - 3 r4 = r4 + 1 r7 = r4 * r9 r6 = r4 << 2

- 9 - Induction Variable Strength Reduction (2) v Rules »X is a *, <<, + or – operation »src1(X) is a basic ind var »src2(X) is invariant »No other ops modify dest(X) »dest(X) != src(X) for all srcs »dest(X) is a register v Transformation »Insert the following into the preheader  new_reg = RHS(X) »If opcode(X) is not add/sub, insert to the bottom of the preheader  new_inc = inc(src1(X)) opcode(X) src2(X) »else  new_inc = inc(src1(X)) »Insert the following at each update of src1(X)  new_reg += new_inc »Change X  dest(X) = new_reg r5 = r4 - 3 r4 = r4 + 1 r7 = r4 * r9 r6 = r4 << 2

- 10 - Homework Problem Optimize this applying induction var str reduction r5 = r5 + 1 r11 = r5 * 2 r10 = r11 + 2 r12 = load (r10+0) r9 = r1 << 1 r4 = r9 - 10 r3 = load(r4+4) r3 = r3 + 1 store(r4+0, r3) r7 = r3 << 2 r6 = load(r7+0) r13 = r2 - 1 r1 = r1 + 1 r2 = r2 + 1 r1 = 0 r2 = 0 r13, r12, r6, r10 liveout

- 11 - ILP Optimization v Traditional optimizations »Redundancy elimination »Reducing operation count v ILP (instruction-level parallelism) optimizations »Increase the amount of parallelism and the ability to overlap operations »Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion) v ILP increased by breaking dependences »True or flow = read after write dependence »False or (anti/output) = write after read, write after write

- 12 - Global Register Renaming v Straight-line code strategy does not work »A single use may have multiple reaching defs v Web = Collection of defs/uses which have possible value flow between them »Identify webs  Take a def, add all uses  Take all uses, add all reaching defs  Take all defs, add all uses  repeat until stable soln »Each web renamed if name is the same as another web x = y = = y = x x = y = = y= x y = = y

- 13 - Back Substitution v Generation of expressions by compiler frontends is very sequential »Account for operator precedence »Apply left-to-right within same precedence v Back substitution »Create larger expressions  Iteratively substitute RHS expression for LHS variable »Note – may correspond to multiple source statements »Enable subsequent optis v Optimization »Re-compute expression in a more favorable manner r9 = r1 + r2 r10 = r9 + r3 r11 = r10 - r4 r12 = r11 + r5 r13 = r12 – r6 Subs r12: r13 = r11 + r5 – r6 Subs r11: r13 = r10 – r4 + r5 – r6 Subs r10 r13 = r9 + r3 – r4 + r5 – r6 Subs r9 r13 = r1 + r2 + r3 – r4 + r5 – r6 y = a + b + c – d + e – f;

- 14 - Tree Height Reduction v Re-compute expression as a balanced binary tree »Obey precedence rules »Essentially re-parenthesize »Combine literals if possible v Effects »Height reduced (n terms)  n-1 (assuming unit latency)  ceil(log2(n)) »Number of operations remains constant »Cost  Temporary registers “live” longer »Watch out for  Always ok for integer arithmetic  Floating-point – may not be!! r9 = r1 + r2 r10 = r9 + r3 r11 = r10 - r4 r12 = r11 + r5 r13 = r12 – r6 r13 = r1 + r2 + r3 – r4 + r5 – r6 r1 + r2r3 – r4r5 – r6 + + t1 = r1 + r2 t2 = r3 – r4 t3 = r5 – r6 t4 = t1 + t2 r13 = t4 + t3 r13 after back subs: original: final code:

- 15 - Class Problem Assume: + = 1, * = 3 0 r1 0 r2 0 r3 1 r4 2 r5 0 r6 operand arrival times r10 = r1 * r2 r11 = r10 + r3 r12 = r11 + r4 r13 = r12 – r5 r14 = r13 + r6 Back susbstitute Re-express in tree-height reduced form Account for latency and arrival times

- 16 - Optimizing Unrolled Loops r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 Unroll = replicate loop body n-1 times. Hope to enable overlap of operation execution from different iterations Not possible! loop: unroll 3 times

- 17 - Register Renaming on Unrolled Loop r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:

- 18 - Register Renaming is Not Enough! v Still not much overlap possible v Problems »r2, r4, r6 sequentialize the iterations »Need to rename these v 2 specialized renaming optis »Accumulator variable expansion (r6) »Induction variable expansion (r2, r4) r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:

- 19 - Accumulator Variable Expansion v Accumulator variable »x = x + y or x = x – y »where y is loop variant!! v Create n-1 temporary accumulators v Each iteration targets a different accumulator v Sum up the accumulator variables at the end v May not be safe for floating- point values r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r16 = r16 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26

- 20 - Induction Variable Expansion v Induction variable »x = x + y or x = x – y »where y is loop invariant!! v Create n-1 additional induction variables v Each iteration uses and modifies a different induction variable v Initialize induction variables to init, init+step, init+2*step, etc. v Step increased to n*original step v Now iterations are completely independent !! r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 12 r4 = r4 + 12 r11 = load(r12) r13 = load(r14) r15 = r11 * r13 r16 = r16 + r15 r12 = r12 + 12 r14 = r14 + 12 r21 = load(r22) r23 = load(r24) r25 = r21 * r23 r26 = r26 + r25 r22 = r22 + 12 r24 = r24 + 12 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26 r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8

- 21 - Better Induction Variable Expansion v With base+displacement addressing, often don’t need additional induction variables »Just change offsets in each iterations to reflect step »Change final increments to n * original step r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r11 = load(r2+4) r13 = load(r4+4) r15 = r11 * r13 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 r26 = r26 + r25 r2 = r2 + 12 r4 = r4 + 12 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26

- 22 - Homework Problem r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: Optimize the unrolled loop Renaming Tree height reduction Ind/Acc expansion

- 23 - New Topic - Code Generation v Map optimized “machine-independent” assembly to final assembly code v Input code »Classical optimizations »ILP optimizations »Formed regions (sbs, hbs), applied if-conversion (if appropriate) v Virtual  physical binding »2 big steps »1. Scheduling  Determine when every operation executions  Create MultiOps »2. Register allocation  Map virtual  physical registers  Spill to memory if necessary

- 24 - What Do We Need to Schedule Operations? v Information about the processor »Number of resources »Which resources are used by each operation »Operation latencies »Operand encoding limitations »For example:  2 issue slots, 1 memory port, 1 adder/multiplier  load = 2 cycles, add = 1 cycle, mpy = 3 cycles; all fully pipelined  Each operand can be register or 6 bit signed literal v Ordering constraints amongst operations »What order defines correct program execution? »Need a precedence graph – flow, anti, output deps  What about memory deps? control deps? Delay slots?

- 25 - How Do We Schedule? v When is it legal to schedule an instruction? »Correct execution is maintained »Resources not oversubscribed v Given multiple operations that can be scheduled, how do you pick the best one? »How do you know it is the best one?  What about a good guess?  Does it matter, just pick one at random? »Are decisions final?, or is this an iterative process? v How do we keep track of resources that are busy/free »Need a reservation table  Matrix (resources x time) r1 = load(r10) r2 = load(r11) r3 = r1 + 4 r4 = r1 – r12 r5 = r2 + r4 r6 = r5 + r3 r7 = load(r13) r8 = r7 * 23 store (r8, r6)

- 26 - More Stuff to Worry About v Model more resources »Register ports, output busses »Non-pipelined resources v Dependent memory operations v Multiple clusters »Cluster = group of FUs connected to a set of register files such that an FU in a cluster has immediate access to any value produced within the cluster »Multicluster = Processor with 2 or more clusters, clusters often interconnected by several low-bandwidth busses  Bottom line = Non-uniform access latency to operands v Scheduler has to be fast »NP complete problem »So, need a heuristic strategy v What is better to do first, scheduling or register allocation?

- 27 - Schedule Before or After Register Allocation? r1 = load(r10) r2 = load(r11) r3 = r1 + 4 r4 = r1 – r12 r5 = r2 + r4 r6 = r5 + r3 r7 = load(r13) r8 = r7 * 23 store (r8, r6) R1 = load(R1) R2 = load(R2) R5 = R1 + 4 R1 = R1 – R3 R2 = R2 + R1 R2 = R2 + R5 R5 = load(R4) R5 = R5 * 23 store (R5, R2) physical registers virtual registers Too many artificial ordering constraints if schedule after allocation!!!! But, need to schedule after allocation to bind spill code Solution, do both! Prepass schedule, register allocation, postpass schedule

- 28 - Data Dependences v Data dependences »If 2 operations access the same register, they are dependent »However, only keep dependences to most recent producer/consumer as other edges are redundant »Types of data dependences Flow OutputAnti r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 r1 = r2 + r3 r2 = r5 * 6

- 29 - More Dependences v Memory dependences »Similar as register, but through memory »Memory dependences may be certain or maybe v Control dependences »We discussed this earlier »Branch determines whether an operation is executed or not »Operation must execute after/before a branch »Note, control flow (C0) is not a dependence Mem-flow Mem-outputMem-anti store (r1, r2) r3 = load(r1) store (r1, r2) store (r1, r3) r2 = load(r1) store (r1, r3) Control (C1) if (r1 != 0) r2 = load(r1)

- 30 - Dependence Graph v Represent dependences between operations in a block via a DAG »Nodes = operations »Edges = dependences v Single-pass traversal required to insert dependences v Example 1: r1 = load(r2) 2: r2 = r1 + r4 3: store (r4, r2) 4: p1 = cmpp (r2 < 0) 5: branch if p1 to BB3 6: store (r1, r2) 1 2 5 4 3 6 BB3:

- 31 - Dependence Edge Latencies v Edge latency = minimum number of cycles necessary between initiation of the predecessor and successor in order to satisfy the dependence v Register flow dependence, a  b »Latest_write(a) – Earliest_read(b) (earliest_read typically 0) v Register anti dependence, a  b »Latest_read(a) – Earliest_write(b) + 1 (latest_read typically equal to earliest_write, so anti deps are 1 cycle) v Register output dependence, a  b »Latest_write(a) – Earliest_write(b) + 1 (earliest_write typically equal to latest_write, so output deps are 1 cycle) v Negative latency »Possible, means successor can start before predecessor »We will only deal with latency >= 0, so MAX any latency with 0

- 32 - Dependence Edge Latencies (2) v Memory dependences, a  b (all types, flow, anti, output) »latency = latest_serialization_latency(a) – earliest_serialization_latency(b) + 1 (generally this is 1) v Control dependences »branch  b  Op b cannot issue until prior branch completed  latency = branch_latency »a  branch  Op a must be issued before the branch completes  latency = 1 – branch_latency (can be negative)  conservative, latency = MAX(0, 1-branch_latency)

- 33 - Class Problem r1 = load(r2) r2 = r2 + 1 store (r8, r2) r3 = load(r2) r4 = r1 * r3 r5 = r5 + r4 r2 = r6 + 4 store (r2, r5) machine model latencies add: 1 mpy: 3 load: 2 sync 1 store: 1 sync 1 1. Draw dependence graph 2. Label edges with type and latencies

EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012.

Similar presentations

Presentation on theme: "EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012.

Similar presentations

Presentation on theme: "EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012."— Presentation transcript:

Similar presentations

About project

Feedback