Lecture 14: Performance Optimization

Slides:

Advertisements

Similar presentations

CS 140 Lecture 11 Sequential Networks: Timing and Retiming Professor CK Cheng CSE Dept. UC San Diego 1.

Advertisements

1 Lecture 16 Timing  Terminology  Timing issues  Asynchronous inputs.

1 COMP541 Flip-Flop Timing Montek Singh Oct 6, 2014.

ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.

1 Lecture 28 Timing Analysis. 2 Overview °Circuits do not respond instantaneously to input changes °Predictable delay in transferring inputs to outputs.

Sequential Logic 1 clock data in may changestable data out (Q) stable Registers  Sample data using clock  Hold data between clock cycles  Computation.

Synchronous Digital Design Methodology and Guidelines

1 Digital Design: State Machines Timing Behavior Credits : Slides adapted from: J.F. Wakerly, Digital Design, 4/e, Prentice Hall, 2006 C.H. Roth, Fundamentals.

RTL Hardware Design by P. Chu Chapter 161 Clock and Synchronization.

Assume array size is 256 (mult: 4ns, add: 2ns)

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

CSE 140L Lecture 6 Interface and State Assignment Professor CK Cheng CSE Dept. UC San Diego 1.

11/15/2004EE 42 fall 2004 lecture 321 Lecture #32 Registers, counters etc. Last lecture: –Digital circuits with feedback –Clocks –Flip-Flops This Lecture:

ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.

Lecture 5. Sequential Logic 3 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.

© BYU 18 ASYNCH Page 1 ECEn 224 Handling Asynchronous Inputs.

1 CSE370, Lecture 17 Lecture 17 u Logistics n Lab 7 this week n HW6 is due Friday n Office Hours íMine: Friday 10:00-11:00 as usual íSara: Thursday 2:30-3:20.

Timing Analysis Section Delay Time Def: Time required for output signal Y to change due to change in input signal X Up to now, we have assumed.

Sequential Networks: Timing and Retiming

Chapter 3 Computer System Architectures Based on

Copyright © 2007 Elsevier3- Sequential Logic Circuits Design.

1 COMP541 Sequential Logic Timing Montek Singh Sep 30, 2015.

EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 20: October 25, 2010 Pass Transistors.

Lecture 4. Sequential Logic #3 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE221, COMP211 Logic Design.

Overview Part 1 - Storage Elements and Sequential Circuit Analysis

Overview Logistics Last lecture Today HW5 due today

Digital Design - Sequential Logic Design

Lecture 10 Flip-Flops/Latches

Digital Integrated Circuits A Design Perspective

Lecture 11: Sequential Circuit Design

Chapter 3 Digital Design and Computer Architecture, 2nd Edition

Chapter 3 Digital Design and Computer Architecture: ARM® Edition

CSE 140 – Discussion 7 Nima Mousavi.

Chapter #6: Sequential Logic Design

Flip Flops Lecture 10 CAP

Timing and Verification

Digital Logic Design Alex Bronstein Lecture 2: Pipelines.

Clocks A clock is a free-running signal with a cycle time.

Sequential Logic Combinational logic:

EMT 351/4 DIGITAL IC DESIGN Week # Synthesis of Sequential Logic 10.

Sequential circuit design with metastability

CS Spring 2008 – Lec #17 – Retiming - 1

Sequential Logic and Flip Flops

Overview Part 1 – The Design Space

Instructor: Alexander Stoytchev

Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits

332:437 Lecture 12 Finite State Machine Design

Sequential Logic and Flip Flops

CS Fall 2005 – Lec. #5 – Sequential Logic - 1

CSE 370 – Winter Sequential Logic-2 - 1

COMP541 Flip-Flop Timing Montek Singh Feb 23, 2010.

Timing Analysis 11/21/2018.

Day 26: November 1, 2013 Synchronous Circuits

Clocking in High-Performance and Low-Power Systems Presentation given at: EPFL Lausanne, Switzerland June 23th, 2003 Vojin G. Oklobdzija Advanced.

CSE 370 – Winter Sequential Logic - 1

CSE 370 – Winter Sequential Logic-2 - 1

332:437 Lecture 8 Verilog and Finite State Machines

Pipeline Principle A non-pipelined system of combination circuits (A, B, C) that computation requires total of 300 picoseconds. Comb. logic.

Lecture 14: Timing Analysis and Timed Simulation

ECE 352 Digital System Fundamentals

ECE 352 Digital System Fundamentals

COMP541 Sequential Logic Timing

Lecture 19 Logistics Last lecture Today

332:437 Lecture 8 Verilog and Finite State Machines

CSE 370 – Winter Sequential Logic-2 - 1

Instructor: Michael Greenbaum

Design of Digital Circuits Lecture 8: Timing and Verification

Lecture 3: Timing & Sequential Circuits

Presentation transcript:

Lecture 14: Performance Optimization UCSD ECE 111 Prof. Farinaz Koushanfar Fall 2017 UCSD ECE 111, Prof. Koushanfar, Fall 2017 Some slides are courtesy of Prof. Patrick Schaumont

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Area-Delay Product Area (eg. Slices) Suboptimal Points Optimal Points Delay = Performance-1 (e.g. cycles) UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Area-Delay Product Area (eg. Slices) Area Constraint Best Choice Delay = Performance-1 (e.g. cycles) UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Area-Delay Product Area (eg. Slices) Best Choice Delay Constraint Delay = Performance-1 (e.g. cycles) UCSD ECE 111, Prof. Koushanfar, Fall 2017

Optimizing Performance - Overview Timing constraint and timing analysis Performance factors of a digital design Latency and Throughput Delay = Clock Period * Cycle Count What determines the minimum clock period? Performance Optimizations you can do in Verilog Parallel Computations Pipelining Retiming Summary Optimizing Area & Performance UCSD ECE 111, Prof. Koushanfar, Fall 2017

Timing Flip-flop samples D at clock edge D must be stable when sampled Similar to a photograph, D must be stable around clock edge If not, metastability can occur

Input Timing Constraints Setup time: tsetup = time before clock edge data must be stable (i.e. not changing) Hold time: thold = time after clock edge data must be stable Aperture time: ta = time around clock edge data must be stable (ta = tsetup + thold)

Output Timing Constraints Propagation delay: tpcq = time after clock edge that the output Q is guaranteed to be stable (i.e., to stop changing) Contamination delay: tccq = time after clock edge that Q might be unstable (i.e., start changing)

Dynamic Discipline Synchronous sequential circuit inputs must be stable during aperture (setup and hold) time around clock edge Specifically, inputs must be stable at least tsetup before the clock edge at least until thold after the clock edge

Dynamic Discipline The delay between registers has a minimum and maximum delay, dependent on the delays of the circuit elements

Setup Time Constraint Tc ≥ Depends on the maximum delay from register R1 through combinational logic to R2 The input to register R2 must be stable at least tsetup before clock edge Tc ≥

Setup Time Constraint Tc ≥ tpcq + tpd + tsetup tpd ≤ Depends on the maximum delay from register R1 through combinational logic to R2 The input to register R2 must be stable at least tsetup before clock edge Tc ≥ tpcq + tpd + tsetup tpd ≤

Setup Time Constraint Tc ≥ tpcq + tpd + tsetup Depends on the maximum delay from register R1 through combinational logic to R2 The input to register R2 must be stable at least tsetup before clock edge Tc ≥ tpcq + tpd + tsetup tpd ≤ Tc – (tpcq + tsetup) (tpcq + tsetup): sequencing overhead

Hold Time Constraint thold < Depends on the minimum delay from register R1 through the combinational logic to R2 The input to register R2 must be stable for at least thold after the clock edge thold <

Hold Time Constraint thold < tccq + tcd tcd > Depends on the minimum delay from register R1 through the combinational logic to R2 The input to register R2 must be stable for at least thold after the clock edge thold < tccq + tcd tcd >

Hold Time Constraint thold < tccq + tcd tcd > thold - tccq Depends on the minimum delay from register R1 through the combinational logic to R2 The input to register R2 must be stable for at least thold after the clock edge thold < tccq + tcd tcd > thold - tccq

Timing Analysis Timing Characteristics tccq = 30 ps tpcq = 50 ps tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = tcd = Setup time constraint: Tc ≥ fc = Hold time constraint: tccq + tcd > thold ?

Timing Analysis Timing Characteristics tccq = 30 ps tpcq = 50 ps tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = 3 x 35 ps = 105 ps tcd = 25 ps Setup time constraint: Tc ≥ (50 + 105 + 60) ps = 215 ps fc = 1/Tc = 4.65 GHz Hold time constraint: tccq + tcd > thold ? (30 + 25) ps > 70 ps ? No!

Timing Analysis Timing Characteristics Add buffers to the short paths: tccq = 30 ps tpcq = 50 ps tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = tcd = Setup time constraint: Tc ≥ fc = Hold time constraint: tccq + tcd > thold ?

Timing Analysis Timing Characteristics Add buffers to the short paths: tccq = 30 ps tpcq = 50 ps tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = 3 x 35 ps = 105 ps tcd = 2 x 25 ps = 50 ps Setup time constraint: Tc ≥ (50 + 105 + 60) ps = 215 ps fc = 1/Tc = 4.65 GHz Hold time constraint: tccq + tcd > thold ? (30 + 50) ps > 70 ps ? Yes!

Clock Skew The clock doesn’t arrive at all registers at same time Skew: difference between two clock edges Perform worst case analysis to guarantee dynamic discipline is not violated for any register – many registers in a system!

Setup Time Constraint with Skew In the worst case, CLK2 is earlier than CLK1 Tc ≥

Setup Time Constraint with Skew In the worst case, CLK2 is earlier than CLK1 Tc ≥ tpcq + tpd + tsetup + tskew tpd ≤

Setup Time Constraint with Skew In the worst case, CLK2 is earlier than CLK1 Tc ≥ tpcq + tpd + tsetup + tskew tpd ≤ Tc – (tpcq + tsetup + tskew)

Hold Time Constraint with Skew In the worst case, CLK2 is later than CLK1 tccq + tcd >

Hold Time Constraint with Skew In the worst case, CLK2 is later than CLK1 tccq + tcd > thold + tskew tcd >

Hold Time Constraint with Skew In the worst case, CLK2 is later than CLK1 tccq + tcd > thold + tskew tcd > thold + tskew – tccq

Violating the Dynamic Discipline Asynchronous (for example, user) inputs might violate the dynamic discipline

Metastability Bistable devices: two stable states, and a metastable state between them Flip-flop: two stable states (1 and 0) and one metastable state If flip-flop lands in metastable state, could stay there for an undetermined amount of time

Flip-Flop Internals Flip-flop has feedback: if Q is somewhere between 1 and 0, cross-coupled gates drive output to either rail (1 or 0) Metastable signal: if it hasn’t resolved to 1 or 0 If flip-flop input changes at random time, probability that output Q is metastable after waiting some time, t: P(tres > t) = (T0/Tc ) e-t/τ tres : time to resolve to 1 or 0 T0, τ : properties of the circuit

Metastability Intuitively: T0/Tc: probability input changes at a bad time (during aperture) P(tres > t) = (T0/Tc ) e-t/τ τ: time constant for how fast flip-flop moves away from metastability In short, if flip-flop samples metastable input, if you wait long enough (t), the output will have resolved to 1 or 0 with high probability.

Synchronizers Asynchronous inputs are inevitable (user interfaces, systems with different clocks interacting, etc.) Synchronizer goal: make the probability of failure (the output Q still being metastable) low Synchronizer cannot make the probability of failure 0

Synchronizer Internals Synchronizer: built with two back-to-back flip-flops Suppose D is transitioning when sampled by F1 Internal signal D2 has (Tc - tsetup) time to resolve to 1 or 0

Synchronizer Probability of Failure For each sample, probability of failure is: P(failure) = (T0/Tc ) e-(Tc - tsetup)/τ

Synchronizer Mean Time Between Failures If asynchronous input changes once per second, probability of failure per second is P(failure). If input changes N times per second, probability of failure per second is: P(failure)/second = (NT0/Tc) e-(Tc - tsetup)/τ Synchronizer fails, on average, 1/[P(failure)/second] Called mean time between failures, MTBF: MTBF = 1/[P(failure)/second] = (Tc/NT0) e(Tc - tsetup)/τ

Example Synchronizer Suppose: Tc = 1/500 MHz = 2 ns τ = 200 ps T0 = 150 ps tsetup = 100 ps N = 10 events per second What is the probability of failure? MTBF?

Example Synchronizer Suppose: Tc = 1/500 MHz = 2 ns τ = 200 ps T0 = 150 ps tsetup = 100 ps N = 10 events per second What is the probability of failure? MTBF? P(failure) = (150 ps/2 ns) e-(1.9 ns)/200 ps = 5.6 × 10-6 P(failure)/second = 10 × (5.6 × 10-6 ) = 5.6 × 10-5 / second MTBF = 1/[P(failure)/second] ≈ 5 hours

Performance (Delay) of a design Two common definitions for the performance of a design The time it takes to compute an output starting from a given input: Latency The rate at which new outputs are produced (or the rate at which new inputs are consumed): Throughput Depending on the application, you will need to optimize one or the other The unit of Throughput and Latency is time (seconds). If you find a number in cycles, you have to find the clock period T before you know the throughput or latency. Performance = 1 / Throughput or 1 / Latency We will use the generic term Delay Delay = 1 / Performance UCSD ECE 111, Prof. Koushanfar, Fall 2017

Digital Synchronous Design Delay of a design Latency = (cycles from I to O) * (clock period) Example: System clock frequency = 20MHz ( = 50 ns period) Each output is available 10 clock cycles after a new input is provided Latency = 10 * 50 ns = 500 ns 10 cycles Digital Synchronous Design Input Output fCLK UCSD ECE 111, Prof. Koushanfar, Fall 2017

Digital Synchronous Design Delay of a design Throughput = (cycles between I) * (clock period) Example: System clock frequency = 20MHz Each 5 clock cycles a new input is accepted Throughput = 5 * 50 ns = 250 ns Digital Synchronous Design Input Output 5 cycles fCLK UCSD ECE 111, Prof. Koushanfar, Fall 2017

Delay = Cycle Count * Clock Period Delay (either latency or throughput) has two components Cycle Count. This quantity is controlled by the kind of Verilog that the designer writes Clock Period. This quantity is determined by the synthesis tools that map the Verilog into an implementation Cycle Count is fixed by Verilog Code Verilog Tools For a given design (Verilog), Clock Period is fixed by Technology Parameters, e.g. gate delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Delay = Cycle Count * Clock Period Delay (either latency or throughput) has two components Cycle Count. This quantity is controlled by the kind of Verilog that the designer writes Clock Period. This quantity is determined by the synthesis tools that map the Verilog into an implementation Since a designer cannot choose the clock period directly, the designer will constrain it "Tools, I want you to implement this circuit with a clock period smaller then 25 ns" The tools will try to match this constraint The tools may or may not achieve the desired clock period If the tools cannot achieve the constraint, the tools report a timing violation UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Optimizing the Delay Delay = Cycle Count * Clock Period So to decrease the system delay, we can do two things: Decrease the cycle count Decrease the clock period Both options can be influenced by the Verilog Designer We will focus on how to influence the clock period (which seems the least obvious to do) Let's first consider the factors in a digital design that determine the minimum clock period UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup UCSD ECE 111, Prof. Koushanfar, Fall 2017

a clock edge has occured Minimum Clock Period Combinational Logic A CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup A Tclk->Q This is the time need for the output of a flip-flop to switch to a new value after a clock edge has occured UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup A B TLogic + TRouting This is the time need for the logic to calculate a new output. The delay is caused by gates as well as wires. UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup A B TSetup This is the time need for the flipflop to capture stable input data at the next clock edge. The next clock edge cannot come earlier then the dashed line UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Tclk CLK A Tclk,min B In this case, the timing of the system is OK, since the actual Tclk > Tclk,min UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Tclk CLK A Tslack Tclk,min B The margin between the actual clock period and the minimal clock period is called slack. Tslack = Tclk - Tclk,min UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period Combinational Logic A B CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Tclk CLK A Tslack Tclk,min B If the slack is negative, the system has a timing violation. This system will not perform as expected, since its clock frequency is too high. ECE 4514 Digital Design II Patrick Schaumont Lecture 19: Optimizing Performance UCSD ECE 111, Prof. Koushanfar, Fall 2017 Spring 2008

An example from the Spartan 3E datasheet is shown on the right. Minimum Clock Period Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Once the technology is chosen, Tclk->Q and Tsetup are fixed. An example from the Spartan 3E datasheet is shown on the right. Patrick Schaumon t Spring 2008 UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Minimum Clock Period Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup However, even after the technology is chosen, the designer can still influence Tlogic and Trouting by making modifications to the Verilog code. Thus, if we want to decrease the minimum clock period, we need to consider these terms. UCSD ECE 111, Prof. Koushanfar, Fall 2017

UCSD ECE 111, Prof. Koushanfar, Fall 2017 Delay optimization We will consider three techniques that can be used to decrease the combinational delay of a system If used properly, these may decrease the delay of a digital design. Technique 1: Parallel Computations Technique 2: Pipelining Technique 3: Retiming UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 1: Parallel Computations Hardware is concurrent, it's easy to do several things in parallel However, different combinational networks can have a different delay while still implementing the same function a b c d + Parallel + + + + Not Parallel + q q UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Datapath Example We need to write our Verilog such that as much gates as possible will work in parallel Example: Karatsuba Multiplication A 16-bit x 16-bit multiplication can be written using 8- bit multiplications as follows q = a * b // a, b are 16 bit a = a1 << 8 + a2 // a1, a2 are 8 bit b b1 << 8 + b2 b1, b2 a2) * (b1 << 8 + b2) q = (a1 * b1) << 16 + (a1 * b2 + a2 * b1) << 8 + (a2 * b2) q = (a1 << 8 + UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Datapath Example Karatsuba Multiplication creates a structure that enables parallel addition of partial products Note that this can be applied recursively: 8 bit * 8 bit into 4 times 4 bit * 4 bit, etc .. 16 16 8 8 8 8 8 8 8 8 16 bit X 16 bit 8 bit X 8 bit 8 bit X 8 bit 8 bit X 8 bit 8 bit X 8 bit 32 16 16 16 16 + + + UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Control Example Nested if-then-else statements in Verilog create a priority decoder module selection(q, a, b); output [3:0] q; reg input a, b; always @(*) begin q <= 0; if (a[0]) q[0] <= b[0]; else if (a[1]) else if (a[2]) else if (a[3]) end q[1] q[2] q[3] <= b[1]; <= b[2]; <= b[3]; endmodule gates with more inputs are slower UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Control Example If you know (from the design specs) that this priority encoding is not needed, you can skip the else-if module selection(q, a, b); output [3:0] q; q; a, b; reg input [3:0] always @(*) q <= 0; begin if (a[0]) if (a[1]) if (a[2]) if (a[3]) end q[0] q[1] q[2] q[3] <= b[0]; <= b[1]; <= b[2]; <= b[3]; Four 2-input AND in parallel endmodule UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 2: Pipelining Cut a long combinational path in half by inserting a register Increases the latency cycle count of the design to get form the input to the output, you will need an extra clock cycle a b c d a b c d + + + + insert register here + + q q Tlogic, before Tlogic, after UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 2: Pipelining in Verilog It's easy to pipeline a Verilog program, but you have to pay attention to insert pipeline registers consistently ! Before Pipelining: After Pipelining: module add4a(q, a, b, c, d); module add4b(q, a, b, c, d, clk); output input [127:0] q; a, b, c, d; output input input reg [127:0] clk; [127:0] q; a, b, c, d; assign q = a + endmodule b + c + d; pipe1, pipe2; always @(posedge clk) begin pipe1 <= pipe2 <= end a + b; c + d; assign q = pipe1 + pipe2; endmodule UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 2: Pipelining in Verilog The following is example is WRONG: Before Pipelining: After Inconsistent Pipelining: module add4c(q, a, b, c, d, clk); module add4a(q, a, b, c, d); output [127:0] q; input clk; a, b, c, d; reg pipe1, pipe2; output input [127:0] q; a, b, c, d; assign q = a + endmodule b + c + d; always @(posedge clk) begin pipe1 <= a + b; end This will add inputs from cycle N, and a partial result from cycle N-1 assign q = pipe1 + c + d; endmodule Inconsistent pipelining means: the pipelined module will never generate the same outputs as the original module, even after accounting for the latency effects of pipeline registers UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Assume a network of modules as follows (the modules can be combinational or sequential). We will demonstrate how to move pipeline registers around while avoiding inconsistent pipelining UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining You can always add a register in front. It increases the latency of the network with one cycle, but the network will have the same functionality UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining You can absorb a register at a single input if you recreate it at ALL the outputs of the module. This transformation will not change the latency nor the functionality of the network. in out1 out2 Think of moving a registers over a fork: UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Move it over another module - absorb register at the module inputs, recreate it to the module outputs in out1 UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Move it over the last module - absorb register at the module inputs, recreate it at the module output in1 out in2 Think of joining two registers: UCSD ECE 111, Prof. Koushanfar, Fall 2017

All of these have the same behavior UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining We can add multiple registers at the front ... UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining and redistribute them using consistent pipelining UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining and redistribute them using consistent pipelining UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Tclk,min = 90ns Latency = 1 cycle Throughput = 1 cycle 30 ns 30 ns 30 ns Tclk,min = 30ns Latency = 3 cycles Throughput = 1 cycle 30 ns 30 ns 30 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Following these rules, you'll find that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) This is a feedback path There is a single register present in this path UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 To pipeline, add a register at the front UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 To move the pipeline register to the module output, ALL the inputs need to absorb a register UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 In the resulting network, there is still only one register in the loop UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 Before another register can be added, the red register will need to be moved around the loop UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 3: Retiming (or Register Balancing) A pipeline register can cut a piece of combinational logic in smaller pieces. This reduces the Tclk,min for the entire design. 100 ns 50 ns 50 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 3: Retiming (or Register Balancing) Sometimes, you will find that this partitioning is not nicely 50/50. In that case the benefit of pipeline registers to reduce Tclk,min is small, since the design has to be operated at the speed of the slowest stage 100 ns 90 ns 10 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 3: Retiming (or Register Balancing) To maximize the benefit of the (pipeline) registers in a module, they should be balanced so that each stage of combinational logic takes the same amount of logic delay. This is called retiming or register balancing 90 ns 10 ns Move some logic from first stage to second stage 50 ns 50 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Retiming is supported by synthesis tools You only need to make sure you have sufficient registers available qa qb module add4d(q, a, b, c, d, clk); output [15:0] q; reg [15:0] q, qa, qb, qc, qd; input [15:0] b, c, d; input clk; + qc qd always @(posedge clk) begin qa <= a; qb <= b; qc <= c; qd <= d; q <= qa end This is a 16-bit adder for four numbers, but registers have been added at the input and the output + qb + qc + qd; endmodule UCSD ECE 111, Prof. Koushanfar, Fall 2017

Retiming is supported by synthesis Results without register balancing: qa 6.708 ns qb 46 LUTs 80 Flip-flops (= 5 * 16) + qc qd UCSD ECE 111, Prof. Koushanfar, Fall 2017

Retiming is supported by synthesis To enable register balancing, select synthesis- properties, enable register balancing: UCSD ECE 111, Prof. Koushanfar, Fall 2017

Retiming is supported by synthesis Results with register balancing: 6.708 ns 46 LUTs 80 Flip-flops (= 5 * 16) qa qb + qc qd 5.816 ns 46 LUTs 93 Flip-flops That Area-Delay trade-off again ... ! UCSD ECE 111, Prof. Koushanfar, Fall 2017

Summary: Optimizing Area and Performance Optimization Trade-off Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Summary: Optimizing Area and Performance Trade-off (reduce area, increase delay) Resource Sharing Combinational Logic f1 out Area 1 in Optimization (reduce area, reduce delay) Constant-propagation 1 a a Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Summary: Optimizing Area and Performance Resource Sharing Constant-propagation Area Rewrite Verilog for parallelism Add Pipeline Registers Redistribute Pipeline Registers Optimization or Tradeoff Tradeoff Tradeoff Parallel Computation Pipelining Retiming Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Summary: Optimizing Area and Performance Designer needs to control this design space, by modifying Verilog, and adjusting tool options as needed Resource Sharing Area Constant-propagation Verilog Designer Tools Parallel Computation Pipelining Retiming Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017