Penn ESE535 Spring 2015 -- DeHon 1 ESE535: Electronic Design Automation Day 16: March 25, 2015 C  RTL.

Slides:

Advertisements

Similar presentations

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Advertisements

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Program Representations. Representing programs Goals.

1 Data flow analysis Goal : collect information about how a procedure manipulates its data This information is used in various optimizations For example,

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 23, 2008 C  RTL.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 21: April 15, 2009 Routing 1.

Class canceled next Tuesday. Recap: Components of IR Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 12: March 2, 2009 Sequential Optimization (FSM Encoding)

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

Intermediate Code. Local Optimizations

Prof. Fateman CS164 Lecture 211 Local Optimizations Lecture 21.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 18, 2009 Static Timing Analysis and Multi-Level Speedup.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 2: January 21, 2009 C  RTL.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 10: February 18, 2015 Architecture Synthesis (Provisioning, Allocation)

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 23: April 20, 2015 Static Timing Analysis and Multi-Level Speedup.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 2: January 20, 2010 Universality, Gates, Logic Work Preclass Exercise.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 20: April 4, 2011 Static Timing Analysis and Multi-Level Speedup.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.

ESE532: System-on-a-Chip Architecture

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

High-level optimization Jakub Yaghob

Code Optimization.

CS Spring 2008 – Lec #17 – Retiming - 1

ESE532: System-on-a-Chip Architecture

Optimization Code Optimization ©SoftMoore Consulting.

ESE535: Electronic Design Automation

ESE535: Electronic Design Automation

Optimizing Transformations Hal Perkins Autumn 2011

ESE535: Electronic Design Automation

Optimizing Transformations Hal Perkins Winter 2008

Code Optimization Overview and Examples Control Flow Graph

ESE535: Electronic Design Automation

ESE535: Electronic Design Automation

ESE535: Electronic Design Automation

ESE535: Electronic Design Automation

ESE535: Electronic Design Automation

ESE535: Electronic Design Automation

ESE532: System-on-a-Chip Architecture

Presentation transcript:

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 16: March 25, 2015 C  RTL

Penn ESE535 Spring DeHon 2 Today See how get from a language (C) to dataflow Basic translation –Straight-line code –Memory –Basic Blocks –Control Flow –Looping Optimization –If-conversion –Hyperblocks –Common Optimizations –Pipelining –Unrolling Behavioral (C, MATLAB, …) RTL Gate Netlist Layout Masks Arch. Select Schedule FSM assign Two-level, Multilevel opt. Covering Retiming Placement Routing

Penn ESE535 Spring DeHon 3 DOMAIN SPECIFIC RTL GATE TRANSISTOR BEHAVIORAL Design Productivity by Approach a b s q 0 1 d clk GATES/WEEK (Dataquest) K - 2K 2K - 10K 8K - 12K Source: Keutzer (UCB EE 244) Day 1

Penn ESE535 Spring DeHon 4 C Primitives Arithmetic Operators Unary Minus (Negation) -a Addition (Sum) a + b Subtraction (Difference) a - b Multiplication (Product) a * b Division (Quotient) a / b Modulus (Remainder) a % b Things might have a hardware operator for…

Penn ESE535 Spring DeHon 5 C Primitives Bitwise Operators Bitwise Left Shift a << b Bitwise Right Shift a >> b Bitwise One's Complement ~a Bitwise AND a & b Bitwise OR a | b Bitwise XOR a ^ b Things might have a hardware operator for…

Penn ESE535 Spring DeHon 6 C Primitives Comparison Operators Less Than a < b Less Than or Equal To a <= b Greater Than a > b Greater Than or Equal To a >= b Not Equal To a != b Equal To a == b Logical Negation !a Logical AND a && b Logical OR a || b Things might have a hardware operator for…

Penn ESE535 Spring DeHon 7 Expressions: combine operators a*x+b A connected set of operators  Graph of operators * + ax b

Penn ESE535 Spring DeHon 8 Expressions: combine operators a*x+b a*x*x+b*x+c a*(x+b)*x+c ((a+10)*b < 100) A connected set of operators  Graph of operators

Penn ESE535 Spring DeHon 9 C Assignment Basic assignment statement is: Location = expression f=a*x+b * + ax b f

Penn ESE535 Spring DeHon 10 Straight-line code a sequence of assignments What does this mean? g=a*x; h=b+g; i=h*x; j=i+c; bc * ax g + h * i + j

Penn ESE535 Spring DeHon 11 Variable Reuse Variables (locations) define flow between computations Locations (variables) are reusable t=a*x; r=t*x; t=b*x; r=r+t; r=r+c;

Penn ESE535 Spring DeHon 12 Variable Reuse Variables (locations) define flow between computations Locations (variables) are reusable t=a*x; r=t*x; t=b*x; r=r+t; r=r+c; Sequential assignment semantics tell us which definition goes with which use. –Use gets most recent preceding definition.

Penn ESE535 Spring DeHon 13 Dataflow Can turn sequential assignments into dataflow graph through def  use connections t=a*x; r=t*x; t=b*x; r=r+t; r=r+c; ** * + + a x b c

Penn ESE535 Spring DeHon 14 Dataflow Height t=a*x; r=t*x; t=b*x; r=r+t; r=r+c; Height (delay) of DF graph may be less than # sequential instructions. ** * + + a x b c

Penn ESE535 Spring DeHon 15 Lecture Checkpoint Happy with –Straight-line code –Variables Graph for preclass f Next topic: Memory

Penn ESE535 Spring DeHon 16 C Memory Model One big linear address space of locations Most recent definition to location is value Sequential flow of statements Addr New value Current value

Penn ESE535 Spring DeHon 17 C Memory Operations Read/Use a=*p; a=p[0] a=p[c*10+d] Write/Def *p=2*a+b; p[0]=23; p[c*10+d]=a*x+b;

Penn ESE535 Spring DeHon 18 Memory Operation Challenge Memory is just a set of location But memory expressions can refer to variable locations –Does *q and *p refer to same location? –p[0] and p[c*10+d]? –*p and q[c*10+d]? –p[f(a)] and p[g(b)] ?

Penn ESE535 Spring DeHon 19 Pitfall P[i]=23 r=10+P[i] P[j]=17 s=P[j]*12 Value of r and s? Could do: P[i]=23; P[j]=17; r=10+P[i]; s=P[j]*12 ….unless i==j Value of r and s?

Penn ESE535 Spring DeHon 20 C Pointer Pitfalls *p=23 r=10+*p; *q=17 s=*q*12; Similar limit if p==q

Penn ESE535 Spring DeHon 21 C Memory/Pointer Sequentialization Must preserve ordering of memory operations –A read cannot be moved before write to memory which may redefine the location of the read Conservative: any write to memory Sophisticated analysis may allow us to prove independence of read and write –Writes which may redefine the same location cannot be reordered

Penn ESE535 Spring DeHon 22 Consequence Expressions and operations through variables (whose address is never taken) can be executed at any time –Just preserve the dataflow Memory assignments must execute in strict order –Ideally: partial order –Conservatively: strict sequential order of C

Penn ESE535 Spring DeHon 23 Forcing Sequencing Demands we introduce some discipline for deciding when operations occur –Could be a FSM –Could be an explicit dataflow token –Callahan uses control register Other uses for timing control –Control –Variable delay blocks –Looping

Penn ESE535 Spring DeHon 24 Scheduled Memory Operations Source: Callahan

Control Penn ESE535 Spring DeHon 25

Conditions If (cond) –DoA Else –DoB While (cond) –DoBody No longer straightline code Code selectively executed Data determines which computation to perform Penn ESE535 Spring DeHon 26

Penn ESE535 Spring DeHon 27 Basic Blocks Sequence of operations with –Single entry point –Once enter execute all operations in block –Set of exits at end begin: x=y; y++; z=y; t=z>20; brfalse t, finish y=4 finish: x=x*y end: BB0: x=y; y++; z=y; t=z>20 br(t,BB1,BB2) BB1: y=4; br BB2 BB2: x=x*y; Basic Blocks?

Penn ESE535 Spring DeHon 28 Basic Blocks Sequence of operations with –Single entry point –Once enter execute all operations in block –Set of exits at end Can dataflow schedule operations within a basic block –As long as preserve memory ordering

Penn ESE535 Spring DeHon 29 Connecting Basic Blocks Connect up basic blocks by routing control flow token –May enter from several places –May leave to one of several places

Penn ESE535 Spring DeHon 30 Connecting Basic Blocks Connect up basic blocks by routing control flow token –May enter from several places –May leave to one of several places BB0 BB1 BB2 begin: x=y; y++; z=y; t=z>20; brfalse t, finish y=4 finish: x=x*y end: BB0: x=y; y++; z=y; t=z>20 br(t,BB1,BB2) BB1: y=4; br BB2 BB2: x=x*y;

Penn ESE535 Spring DeHon 31 Basic Blocks for if/then/else Source: Callahan

Penn ESE535 Spring DeHon 32 Loops sum=0; for (i=0;i<imax;i++) sum+=i; r=sum<<2; sum=0; i=0; i<imax sum+=i; i=i+1; r=sum<<2;

Penn ESE535 Spring DeHon 33 Lecture Checkpoint Happy with –Straight-line code –Variables –Memory –Control Q: Satisfied with implementation this is producing?

Penn ESE535 Spring DeHon 34 Beyond Basic Blocks Basic blocks tend to be limiting Runs of straight-line code are not long For good hardware implementation –Want more parallelism

Penn ESE535 Spring DeHon 35 Simple Control Flow If (cond) { … } else { …} Assignments become conditional In simplest cases (no memory ops), can treat as dataflow node cond select thenelse

Penn ESE535 Spring DeHon 36 Simple Conditionals if (a>b) c=b*c; else c=a*c; a>bb*ca*c c

Penn ESE535 Spring DeHon 37 Simple Conditionals v=a; if (b>a) v=b; If not assigned, value flows from before assignment b>a ba v

Penn ESE535 Spring DeHon 38 Simple Conditionals max=a; min=a; if (a>b) {min=b; c=1;} else {max=b; c=0;} May (re)define many values on each branch. a>b b a min 10 maxc

Preclass G Finish drawing graph for preclass g Penn ESE535 Spring DeHon 39

Penn ESE535 Spring DeHon 40 Recall: Basic Blocks for if/then/else Source: Callahan

Mux Converted if (a>10) a++; else; a— x=a^0x07 Penn ESE535 Spring DeHon 41

Height Reduction Mux converted version has shorter path (lower latency) Why? Penn ESE535 Spring DeHon 42

Height Reduction Mux converted version has shorter path (lower latency) Can execute condition in parallel with then and else clauses Penn ESE535 Spring DeHon 43

Mux Conversion and Memory What might go wrong if we mux- converted the following: If (cond) –*a=0 Else –*b=0 Penn ESE535 Spring DeHon 44

Mux Conversion and Memory What might go wrong if we mux- converted the following: If (cond) –*a=0 Else –*b=0 Don’t want memory operations in non- taken branch to occur. Penn ESE535 Spring DeHon 45

Mux Conversion and Memory If (cond) –*a=0 Else –*b=0 Don’t want memory operations in non- taken branch to occur. Conclude: cannot mux-convert blocks with branches (without additional care) Penn ESE535 Spring DeHon 46

Penn ESE535 Spring DeHon 47 Hyperblocks Can convert if/then/else into dataflow –If/mux-conversion Hyperblock –Single entry point –No internal branches –Internal control flow provided by mux conversion –May exit at multiple points a>bb*ca*c c

Penn ESE535 Spring DeHon 48 Basic Blocks  Hyperblock Source: Callahan

Penn ESE535 Spring DeHon 49 Hyperblock Benefits More code  typically more parallelism –Shorter critical path Optimization opportunities –Reduce work in common flow path –Move logic for uncommon case out of path Makes smaller faster

Penn ESE535 Spring DeHon 50 Common Case Height Reduction Source: Callahan

Penn ESE535 Spring DeHon 51 Common-Case Flow Optimization Source: Callahan

Penn ESE535 Spring DeHon 52 Optimizations Constant propagation: a=10; b=c[a]; Copy propagation: a=b; c=a+d;  c=b+d; Constant folding: c[10*10+4];  c[104]; Identity Simplification: c=1*a+0;  c=a; Strength Reduction: c=b*2;  c=b<<1; Dead code elimination Common Subexpression Elimination: –C[x*100+y]=A[x*100+y]+B[x*100+y] –t=x*100+y; C[t]=A[t]+B[t]; Operator sizing: for (i=0; i<100; i++) b[i]=(a&0xff+i);

Penn ESE535 Spring DeHon 53 Additional Concerns? What are we still not satisfied with? Parallelism in hyperblock –Especially if memory sequentialized Disambiguate memories? Allow multiple memory banks? Only one hyperblock active at a time –Share hardware between blocks? Data only used from one side of mux –Share hardware between sides? Most logic in hyperblock idle? –Couldn’t we pipeline execution?

Penn ESE535 Spring DeHon 54 Pipelining for (i=0;i<MAX;i++) o[i]=(a*x[i]+b)*x[i]+c; If know memory operations independent i<MAX * + * + a b c +read write o x i

Unrolling Put several (all?) executions of loop into straight-line code in the body. for (i=0;i<MAX;i++) o[i]=(a*x[i]+b)*x[i]+c; for (i=0;i<MAX;i+=2) o[i]=(a*x[i]+b)*x[i]+c; o[i+1]=(a*x[i+1]+b)*x[i+1]+c; Penn ESE535 Spring DeHon 55

Unrolling If MAX=4: o[0]=(a*x[0]+b)*x[0]+c; o[1]=(a*x[1]+b)*x[1]+c; o[2]=(a*x[2]+b)*x[2]+c; o[3]=(a*x[3]+b)*x[3]+c; for (i=0;i<MAX;i++) o[i]=(a*x[i]+b)*x[i]+c; for (i=0;i<MAX;i+=2) o[i]=(a*x[i]+b)*x[i]+c; o[i+1]=(a*x[i+1]+b)*x[i+1]+c; Penn ESE535 Spring DeHon 56

Unrolling If MAX=4: o[0]=(a*x[0]+b)*x[0]+c; o[1]=(a*x[1]+b)*x[1]+c; o[2]=(a*x[2]+b)*x[2]+c; o[3]=(a*x[3]+b)*x[3]+c; Benefits? for (i=0;i<MAX;i++) o[i]=(a*x[i]+b)*x[i]+c; for (i=0;i<MAX;i+=2) o[i]=(a*x[i]+b)*x[i]+c; o[i+1]=(a*x[i+1]+b)*x[i+1]+c; Penn ESE535 Spring DeHon 57

Unrolling If MAX=4: o[0]=(a*x[0]+b)*x[0]+c; o[1]=(a*x[1]+b)*x[1]+c; o[2]=(a*x[2]+b)*x[2]+c; o[3]=(a*x[3]+b)*x[3]+c; Create larger basic block. More scheduling freedom. More parallelism. for (i=0;i<MAX;i++) o[i]=(a*x[i]+b)*x[i]+c; for (i=0;i<MAX;i+=2) o[i]=(a*x[i]+b)*x[i]+c; o[i+1]=(a*x[i+1]+b)*x[i+1]+c; Penn ESE535 Spring DeHon 58

Penn ESE535 Spring DeHon 59 Flow Review

Penn ESE535 Spring DeHon 60 Summary Language (here C) defines meaning of operations Dataflow connection of computations Sequential precedents constraints to preserve Create basic blocks Link together Optimize –Merge into hyperblocks with if-conversion –Pipeline, unroll Result is dataflow graph –(can schedule to RTL)

Penn ESE535 Spring DeHon 61 Big Ideas: Semantics Dataflow Mux-conversion Specialization Common-case optimization

Penn ESE535 Spring DeHon 62 Admin Project Assignment HW8 Reading for Monday on web