ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

Slides:



Advertisements
Similar presentations
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Chapter 4 Retiming.
Modern VLSI Design 4e: Chapter 5 Copyright  2008 Wayne Wolf Topics n Memory elements. n Basics of sequential machines.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 4, 2011 Synchronous Circuits.
1 Lecture 28 Timing Analysis. 2 Overview °Circuits do not respond instantaneously to input changes °Predictable delay in transferring inputs to outputs.
ELEC692 VLSI Signal Processing Architecture Lecture 4
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
Sequential Logic 1  Combinational logic:  Compute a function all at one time  Fast/expensive  e.g. combinational multiplier  Sequential logic:  Compute.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Modern VLSI Design 2e: Chapter 5 Copyright  1998 Prentice Hall PTR Topics n Memory elements. n Basics of sequential machines.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
ELEC692 VLSI Signal Processing Architecture Lecture 6
ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.
Algorithmic Transformations
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
Chapter 5 Unfolding.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.
9/20/2004EE 42 fall 2004 lecture 91 Lecture #9 Example problems with capacitors Next we will start exploring semiconductor materials (chapter 2). Reading:
Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.
ECE 331 – Digital System Design Power Dissipation and Propagation Delay.
Digital Signals and Systems
Lecture 9: Structure for Discrete-Time System XILIANG LUO 2014/11 1.
ELEC692 VLSI Signal Processing Architecture Lecture 1
1 Delay Estimation Most digital designs have multiple data paths some of which are not critical. The critical path is defined as the path the offers the.
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Lecture #32 Page 1 ECE 4110–5110 Digital System Design Lecture #32 Agenda 1.Improvements to the von Neumann Stored Program Computer Announcements 1.N/A.
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Chapter 6 Digital Filter Structures
Professor A G Constantinides 1 Signal Flow Graphs Linear Time Invariant Discrete Time Systems can be made up from the elements { Storage, Scaling, Summation.
L7: Pipelining and Parallel Processing VADA Lab..
Copyright © 2001, S. K. Mitra Digital Filter Structures The convolution sum description of an LTI discrete-time system be used, can in principle, to implement.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 11 Binary Adder/Subtractor.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Improving Timing, Area, and Power Speaker: 黃乃珊 Adviser: Prof.
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
Processor Architecture
ELEC692 VLSI Signal Processing Architecture Lecture 3
Bi-CMOS Prakash B.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Dynamic Logic Circuits Static logic circuits allow implementation of logic functions based on steady state behavior of simple nMOS or CMOS structures.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 5, 2012 Synchronous Circuits.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 20: October 25, 2010 Pass Transistors.
ECE 331 – Digital System Design Introduction to Sequential Circuits and Latches (Lecture #16)
1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.
COE 360 Principles of VLSI Design Delay. 2 Definitions.
Digital Logic Design Alex Bronstein Lecture 2: Pipelines.
VLSI Testing Lecture 5: Logic Simulation
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
Serial Multipliers Prawat Nagvajara
102-1 Under-Graduate Project Techniques in VLSI design
{ Storage, Scaling, Summation }
Lecture 16: Parallel Algorithms I
Fundamentals of Computer Science Part i2
Day 26: November 1, 2013 Synchronous Circuits
101-1 Under-Graduate Project Techniques in VLSI design
Multiplier-less Multiplication by Constants
Zhongguo Liu Biomedical Engineering
Real time signal processing
Presentation transcript:

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Technique for improving performance Exploiting parallelism in improving performance Two ways –Pipelining Using pipeline latches to reduce the critical path delay Can exploit to increase the clock speed of sample speed or reduce power consumption at the same speed. –Parallel processing Multiple output are computed in parallel in a clock period with parallel hardware Effective sampling speed is increased with the level of parallelism Can be used for the reduction of power consumption

Example of a 3-tap FIR filter DD X(n) y(n) X(n-1) X(n-2) A B C Let TM be the delay for multiplier and TA be the delay of adder, the sampling period How can we improve the performance??

Pipelining of FIR digital filter By adding latches DD X(n) y(n) X(n-1) X(n-2) A B C D D X(n) y(n) A B C D D Critical path = T M +2T A Critical path = T M +T A Schedule of events in the pipeline ClockInputNode 1Node 2Node 3Output 0X(0)Ax(0)+bx(-1)--- 1X(1)Ax(1)+bx(0)Ax(0)+bx(-1)Cx(-2)Y(0) 2X(2)Ax(2)+Bx(1)Ax(1)+bx(0)Cx(-1)Y(1) 3X(3)Ax(3)+Bx(2)Ax(2)+Bx(1)Cx(0)Y(2)

Pipelining properties M-level pipelining needs M-1 more delay elements in any path from input to output Increase in speed with the following penalty –Increase in system latency –Increase in the number of latches Pipelining latches can only be placed across any feed-forward cutset of the graph (signal flow graph/DFG)

Cutset pipelining Cutset – a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint. Feed-forward cutset – a cutset that the data move in the forward direction on all the edges of the cutset, e.g. dotted line in the previous slide We can place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm. The data movement between the two disjoint sub-graphs only occurs on the feed-forward cutset, delaying or advancing the data movement along all edges on the cutset by the same amount of time do not change the behavior. SG1 SG2 cutset D1

Feed-forward cutset A2A2 A1A1 A3A3 A4A4 A5A5 A6A6 D A2A2 A1A1 A3A3 A4A4 A5A5 A6A6 D D D D Not a valid pipelining A2A2 A1A1 A3A3 A4A4 A5A5 A6A6 D D D D D Must place delays on all edges In the cutset Critical path reduced to 2

Data-broadcast structures The critical path of the original 3-tap FIR filter can be reduced without introducing pipelining latches by transposing the structure Transposition – reversing the direction of all the edges in a given SFG and interchanging the input and output ports preserves the functionality of the system. X(n) y(n) Z -1 a bc y(n) x(n) Z -1 a bc SFG of the FIRTransposed SFG of the FIR

Data-broadcast structures Data-broadcasting structure based on transposed form where data are not stored but are broadcast to all the multipliers simultaneously. DD X(n) y(n) C B A Critical path delay = T M +T A

Fine-grain pipelining Further breakdown the functional units by pipelining to increase performance. E,g. breakdown each multiplier into 2 small units DD X(n) y(n) m1 DDD m2 (6) (4) (2) Critical path delay = T M2 +T A = 4+2 = 6

Parallel Processing Parallel processing and pipelining techniques are duals of each other –Both exploit concurrency available in the computation Parallel processing – computed using duplicate hardware

A Parallel FIR System E.g. 3-tap FIR filter, Single-input-single-output (SISO) system Y(n) = Ax(n)+bx(n-1)+cx(n-2) A parallel system with 3 inputs per clock cycle, level of parallel processing L=3. –Y(3k) = Ax(3k)+bx(3k-1)+cx(3k-2) –Y(3k+1) = Ax(3k+1)+bx(3k)+cx(3k-1) –Y(3k+2) = Ax(3k+2)+bx(3k+1)+cx(3k) SISO X(n) y(n) Sequential System MIMO X(3k) X(3k+1) X(3k+2) y(3k) y(3k+1) y(3k+2) 3-Parallel System

A Parallel FIR System a b c Y(3k+2) c a b Y(3k+1) b c a Y(3k) D D x(3k+2) x(3k+1) x(3k) Parallel system Pipelined system

Complete parallel processing system Serial-to-Parallel COnverter MIMO System Parallel-to-Serial COnverter X(n) Sampling Period=T/4 Clock Period =T Clock Period=T/4 X(4k+3) X(4k+2) X(4k+1) X(4k) y(4k+3) y(4k+2) y(4k+1) y(4k) Y(n)

When should we use parallel over pipeline processing There is fundamental limit to pipelining imposed by the input/output (I/O) bottlenecks. Chip1 Chip2 o/p pad i/p pad T comm. T computation Communication bounded –Communication time (input/output pad delay + wire-delay) is larger than that of computation delay. –Pipelining can only be used to reduce the critical path computation delay. –For communication-bound system, this cannot help. –So only parallel processing can be used to improve the performance. –Further improvement can be achieved by combining pipelining and parallel processing

Low Power Signal Processing Higher speed Low Power Dynamic Power consumption Propagation delay –C charge : the capacitance to be charged/discharged –V o : supply voltage; V t : threshold voltage –K: technology parameter

Pipelining for Low Power P seq =C total V 0 2 f After pipelining, the critical path is reduced, hence we can use a lower voltage V’=  V 0, the new power is P pip =C total  2 V 0 2 f=  2 P seq The power consumption reduction factor, , can be found the following: T seq Sequential (critical path) (V o ) (Vo)(Vo) T pipe Pipelined: (critical path when M=3)

Example Assume –Cap. Of multiplier C M is 5 times of that of an adder C A –Fine grain pipelining is used, and C m1 =3 C A and C m2 = 2 C A –Vdd = 5V and Vt = 0.6V DD X(n) y(n) m1 DDD m2 (6) (4) (2) DD X(n) y(n) C B A

Solution For original filter, C charge =C M +C A =6C A For pipelined filter, C charge =C M1 =C M2 +C A = 3C A Now M = 2,we have 2( .5-0.6) 2 = .(5-0.6) 2, solving this equation, we have  = The voltage of the pipelined filter V pipe = .V o =~3V Power consumption ratio is  2 = 36.4%

Parallel Processing for low power In an L-parallel architecture, we can assume the charge capacitance remain the same, but the total capacitance (i.e. C total ) is increased L times. The clock speed of the L-parallel architecture is reduced to 1/L (i.e. f = 1/L. T pd ) to maintain the same sampling rate Supply voltage can be reduced to .V o since more time is allowed to charge or discsharge the same capacitance.

Parallel Processing for low power Sequential (critical path) (V o ) T seq (Vo)(Vo) 3T seq Parallel: critical path when L=3

Example: Reduce Power by parallel Consider the following FIR filters DDD X(n) y(n) DDD X(2k) y(2k+1) DDD X2k+1) y(2k) Assumption: - C M = 8C A - T M = 8T A - both architectures operate at the sampling period of 9 T A - Supply voltage = 3.3V and Vt = 0.45V

Solution C charge : Sequential: C charge = C M + C A = 9 C A Parallel: C charge = C M + 2C A = 10 C A Power ratio  2 = 0.434