Digital Logic Design 234262 Alex Bronstein Lecture 2: Pipelines.

Slides:



Advertisements
Similar presentations
1 A latch is a pair of cross-coupled inverters –They can be NAND or NOR gates as shown –Consider their behavior (each step is one gate delay in time) –From.
Advertisements

Tutorial 2 Sequential Logic. Registers A register is basically a D Flip-Flop A D Flip Flop has 3 basic ports. D, Q, and Clock.
Lecture 4: CPU Performance
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Introduction to CMOS VLSI Design Sequential Circuits
1 Lecture 28 Timing Analysis. 2 Overview °Circuits do not respond instantaneously to input changes °Predictable delay in transferring inputs to outputs.
Assume array size is 256 (mult: 4ns, add: 2ns)
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Dr. Shi Dept. of Electrical and Computer Engineering.
ENGIN112 L28: Timing Analysis November 7, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 28 Timing Analysis.
Give qualifications of instructors: DAP
EE141 © Digital Integrated Circuits 2nd Timing Issues 1 Latch-based Design.
Sequential Logic 1  Combinational logic:  Compute a function all at one time  Fast/expensive  e.g. combinational multiplier  Sequential logic:  Compute.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
CS 151 Digital Systems Design Lecture 32 Hazards
CS 151 Digital Systems Design Lecture 28 Timing Analysis.
CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza [Adapted.
Lecture 5. Sequential Logic 3 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
Timing Analysis Section Delay Time Def: Time required for output signal Y to change due to change in input signal X Up to now, we have assumed.
ELEC692 VLSI Signal Processing Architecture Lecture 3
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.
CS 61C: Great Ideas in Computer Architecture Sequential Elements, Synchronous Digital Systems 1 Instructors: Vladimir Stojanovic & Nicholas Weaver
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
CS 110 Computer Architecture Lecture 9: Finite State Machines, Functional Units Instructor: Sören Schwertfeger School of.
Digital Design: Sequential Logic Principles
Digital Logic Design Alex Bronstein.
Digital Logic Design Alex Bronstein Lecture 3: Memory and Buses.
Computer Architecture & Operations I
Digital Integrated Circuits A Design Perspective
Lecture 11: Sequential Circuit Design
Chapter 3 Digital Design and Computer Architecture, 2nd Edition
CSE 140 – Discussion 7 Nima Mousavi.
Speed up on cycle time Stalls – Optimizing compilers for pipelining
Flip Flops Lecture 10 CAP
Chapter 7 Designing Sequential Logic Circuits Rev 1.0: 05/11/03
Timing and Verification
Sequential Logic Combinational logic:
Sequential circuit design with metastability
VLSI Testing Lecture 5: Logic Simulation
6. Sequential circuits Rocky K. C. Chang 17 October 2017.
CS Spring 2008 – Lec #17 – Retiming - 1
6. Sequential circuits (v2)
Instructor: Alexander Stoytchev
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits
332:437 Lecture 12 Finite State Machine Design
Sequential Circuits: Latches
Inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #21 State Elements: Circuits that Remember Hello to James Muerle in the.
Guerilla Section #4 10/05 SDS & MIPS Datapath
Elec 2607 Digital Switching Circuits
COMP541 Flip-Flop Timing Montek Singh Feb 23, 2010.
Day 26: November 1, 2013 Synchronous Circuits
Topics Performance analysis..
ECE 434 Advanced Digital System L05
Serial versus Pipelined Execution
CSE 370 – Winter Sequential Logic - 1
Topics Clocking disciplines. Flip-flops. Latches..
Sequential Circuits: Latches
Pipeline Principle A non-pipelined system of combination circuits (A, B, C) that computation requires total of 300 picoseconds. Comb. logic.
Chapter 8. Pipelining.
Lecture 14: Performance Optimization
CSE 370 – Winter Sequential Logic-2 - 1
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
COMP541 Sequential Logic Timing
Lecture 19 Logistics Last lecture Today
Fast Min-Register Retiming Through Binary Max-Flow
Instructor: Michael Greenbaum
Lecture 3: Timing & Sequential Circuits
Presentation transcript:

Digital Logic Design 234262 Alex Bronstein Lecture 2: Pipelines

Reminder: CL Timing Contamination delay (tCD) – minimum time following input change during which validity of previous output is guaranteed Propagation delay (tPD) – maximum time following input change after which the validity of new output is guaranteed Vin VIH VIL t Show hazard Vout tPD tPD VOH VOL t tCD tCD

Reminder: CL Timing Contamination delay (tCD) – minimum time following input change during which validity of previous output is guaranteed Propagation delay (tPD) – maximum time following input change after which the validity of new output is guaranteed in t Show hazard tPD tPD out t tCD tCD

Reminder: DFF Timing Positive edge-triggered D flip-flop (DFF) Logical characteristic: Output Q assumes the value of D after CLK rise Timing characteristic: If input is held constant tSU before and tH after clock rise then old output will stay valid at least tcCQ and new output will become valid at most tpCQ after clock rise D Q CLK CLK tSU tH D NEW tcCQ Q OLD NEW tpCQ

Latency and Throughput Latency: amount of time passing between the beginning of a calculation and its end (input-to-output delay) Throughput: rate at which calculations are performed (=calculations per unit of time)

Latency and Throughput Example: sum of N numbers t0=tPD of each adder ∑ ∑ ∑ ∑ ∑ MEM ∑ ∑ Sequential implementation ∑ Latency = N×t0 Throughput = 1/(N×t0) Combinational implementation Latency = log2N×t0 Throughput = 1/(log2N×t0)

Hardware Underutilization 4 adders are idle in [t0,3t0] 2 adders are idle in [0,t0] and [2t0,3t0] 1 adder is idle in [0,2t0] 66.7% “Unemployment”! ∑ ∑ ∑ ∑ ∑ ∑ t t0 2t0 3t0 INPUT OUTPUT

Hardware Underutilization ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ 33% utilization

Can we do better? Hardware can be better utilized if The calculation has to be evaluated on many inputs (potentially: an infinite input stream) System is compute-bound rather than I/O-bound: input throughput is higher than the calculation throughput How? Supply new input after t0 time

Better Hardware Utilization ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ 100% utilization asymptotically But… violation of CL timing rules! Input must stay valid until output is produced

Better Hardware Utilization ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ Solution: Latch intermediate results into registers

Hardware Underutilization Assembly Line Invention erroneously attributed to H. Ford. Earliest instance: Portsmouth Block Mills, ca. 1803

Wavefront Propagation

Pipeline Pipeline: sequential computational process operating simultaneously on multiple computations Clock period can be lowered to t0 (a little more counting register delays) New input is supplied at every clock Asymptotically (infinite input stream): Throughput  1/t0 Hardware fully utilized Very common in Practically any modern CPU Fast computationally-intensive logic such as graphics, DSP, etc.

Pipeline Timing ∑ ∑ ∑ IN a b c OUT tCLK 2tCLK 3tCLK 4tCLK 5tCLK tCLK 2tCLK 3tCLK 4tCLK 5tCLK tCLK ≥ tpCQ(REG) + tPD(CL) + tSU(REG)

Pipeline Design Naïve approach Design CL circuit first Break the flow by adding registers until max throughput is achieved Problem: unbalanced paths! xn x ∑ ∑ x+y xn-1+yn-1 yn y ∑ xn-2+yn-2+zn-1 xn-2+yn-2+zn-2 ∑ x+y+z zn zn-1 z

k-Pipeline k-pipeline is a feed-forward logical circle Comprising only CL and registers Every path from input to output contains exactly k registers CL is a 0-pipeline Pure pipeline is a k-pipeline with only one path from input to output Usually divided into stages More complicated pipelines may exist (e.g., with feedback loops or paths of unequal length)

k-Pipeline Design Input: CL circuit (0-pipeline) with correct functionality Output: k-pipeline with correct functionality Method: iterative application of two rules

Rule 1: Add Delay Add a register to every output of the entire system Same functionality Result delayed by one cycle W.l.o.g. can be applied to inputs instead of outputs k-pipeline (k+1)-pipeline

Rule 2: Retiming For a given subsystem, remove one register from every output and add one register to every input Functionally indistinguishable subsystems Actual evaluation will now occur later W.l.o.g. inputs can be switched with outputs k-pipeline k-pipeline

Example: Pure Pipeline Latency=2T Throughput=1/2T T T ADD 1-pipeline Latency=2T Throughput=1/2T T ADD 2-pipeline Latency=4T Throughput=1/2T T RETIME 2-pipeline Latency=2T Throughput=1/T T

Example: k-Pipeline 2T T 2T T 2T T 2T T 0-pipeline Latency=4T Throughput=1/4T 1-pipeline 2T T Latency=4T Throughput=1/4T ADD 2-pipeline 2T T Latency=8T Throughput=1/4T 2-pipeline 2T T Latency=6T Throughput=1/3T ADD RETIME

Example: k-Pipeline (cont.) Latency=6T Throughput=1/3T 2-pipeline 2T T Latency=4T Throughput=1/2T RETIME 3-pipeline 2T T Latency=6T Throughput=1/2T 3-pipeline 2T T Latency=6T Throughput=1/2T ADD RETIME

k-Pipeline Timing Pipeline is a sequential logic circuit Clock period is determined by tCLK ≥ tpCQ(REG) + tPD(CL) + tSU(REG) in every path from register to register Usually, tPD(CL) >> tpCQ(REG), tSU(REG) hence tCLK max{tPD(CL)} In the example: tCLK ≥ tpCQ(REG) + max{tPD(A), tPD(C)+tPD(B)} + tSU(REG)  max{tPD(A), tPD(C)+tPD(B)} Additionally, hold conditions must be satisfied tcCQ(REG) + tCD(CL) ≥ tH(REG) (normally tcCQ > tH by design) In the example: min{tCD(A), tCD(C)+tCD(B)} ≥ tH(REG) A B C

k-Pipeline Throughput Ignoring register times tCLK=2T Non asymptotic throughput: 1 input: 1/2tCLK = 1/4T 2 inputs: 2/(2tCLK+tCLK) = 2/6T 3 inputs: 3/(2tCLK+2tCLK) = 3/8T 4 inputs: 3/(2tCLK+3tCLK) = 4/10T … N inputs: 3/(2tCLK+(N-1)tCLK) = N/((N+1)T) Asymptotic throughput (when N): 1/T Input Stage 1 Output N= 0 1 2 3 4 5 6

k-Pipeline Latency Clock period tCLK determined by maximum CL delay If all outputs have registers and k>0: Latency = k tCLK If all inputs have registers and k>0: