ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

Slides:

Advertisements

Similar presentations

Chapter 8. FIR Filter Design

Advertisements

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Chapter 4 Retiming.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

VLSI Communication SystemsRecap VLSI Communication Systems RECAP.

AGC DSP AGC DSP Professor A G Constantinides©1 A Prediction Problem Problem: Given a sample set of a stationary processes to predict the value of the process.

Digital Signal Processing (DSP) Fundamentals. Overview What is DSP? Converting Analog into Digital –Electronically –Computationally How Does It Work?

ELEC692 VLSI Signal Processing Architecture Lecture 4

ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.

VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.

Algorithmic Transformations

Chapter 5 Unfolding.

The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

© 2003 Xilinx, Inc. All Rights Reserved Multi-rate Systems.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

ELEC692 VLSI Signal Processing Architecture Lecture 1

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

L7: Pipelining and Parallel Processing VADA Lab..

Copyright © 2001, S. K. Mitra Digital Filter Structures The convolution sum description of an LTI discrete-time system be used, can in principle, to implement.

This material exempt per Department of Commerce license exception TSU Multi-rate Systems.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

LIST OF EXPERIMENTS USING TMS320C5X Study of various addressing modes of DSP using simple programming examples Sampling of input signal and display Implementation.

EE2174: Digital Logic and Lab Professor Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University CHAPTER 8 Arithmetic.

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

Folding Technique: Compromising in Special Purpose Hardware Design

ELEC692 VLSI Signal Processing Architecture Lecture 3

Exploiting Parallelism

Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.

DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik

Embedded Systems Design

By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014

Serial Multipliers Prawat Nagvajara

102-1 Under-Graduate Project Techniques in VLSI design

Digital Systems Section 14 Registers. Digital Systems Section 14 Registers.

Digital Signal Processors

Digital Filter Design Tools

101-1 Under-Graduate Project Techniques in VLSI design

Multiplier-less Multiplication by Constants

Lecture 17 Logistics Last lecture Today HW5 due on Wednesday

Lecture 17 Logistics Last lecture Today HW5 due on Wednesday

Real time signal processing

Murugappan Senthilvelan May 4th 2004

Presentation transcript:

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation

ADSP Lecture2 - Unfolding Multiple-Data Processing Create a program with more than one iteration, e.g. J loops unrolling Example: Loop unrolling + software pipelining clock cycle operation clock cycle

ADSP Lecture2 - Unfolding Basic Ideas Parallel processingPipelined processing a1a2a3a4 b1b2b3b4 c1c2c3c4 d1d2d3d4 a1b1c1d1 a2b2c2d2 a3b3c3d3 a4b4c4d4 P1 P2 P3 P4 P1 P2 P3 P4 time

ADSP Lecture2 - Unfolding Data Dependence Parallel processing requires NO data dependence between processors Pipelined processing will involve inter-processor communication P1 P2 P3 P4 P1 P2 P3 P4 time

ADSP Lecture2 - Unfolding Parallel Processing In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J)

ADSP Lecture2 - Unfolding Parallel Processing Block processing –the number of inputs processed in a clock cycle is referred to as the block size –at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2)

ADSP Lecture2 - Unfolding I/O Conversion Serial to parallel converter Parallel to serial converter

ADSP Lecture2 - Unfolding General approach for block processing

ADSP Lecture2 - Unfolding Mathematical Formulation e.g. y(n) = ay(n-9) + x(n) 2-parallel Y(2k) = ay(2k-9) + x(2k) Y(2k+1) = ay(2k-8) + x (2k+1) In 2-parallel SDFG, one active clock edge leads two samples Y(2k) = ay(2(k-5)+1) + x(2k) Y(2k+1) = ay(2(k-4)+0) + x(2k+1) Dependency with less than # parallelism of sample delays can be implemented with internal routing

ADSP Lecture2 - Unfolding Unfolding the DFG T=T s T=J T s Not trivial, even for a simple graph

ADSP Lecture2 - Unfolding Block Processing for FIR Filter One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) Block vector: [x(3k) x(3k+1) x(3k+2)] Clock cycle: can be 3 times longer Original (FIR filter): Rewrite 3 equations at a time:

ADSP Lecture2 - Unfolding Block Processing

ADSP Lecture2 - Unfolding Block Processing for IIR Digital Filter Original formulation: Rewrite: Vector formulation: n: sample period k: processor period T sample ≠T clk

ADSP Lecture2 - Unfolding Block IIR Filter D D S/PP/S + +   x(2k) x(2k+1) y(2k+1) y(2k) x(n)y(n) y(2(k  1)) y(2(k  1)+1) clock period not equal to sampling period

ADSP Lecture2 - Unfolding Timing Comparison Pipelining Block processing 1234 x(1)x(2)x(3)x(4) y(1)y(2)y(3)y(4) x(1)x(2)x(3)x(4)x(5)x(6)x(7) MAC y(1)y(2)y(3)y(4)y(5)y(6)y(7) Add a y(1) Mul x(2)x(4)x(6)x(8) x(1)x(3)x(5)x(7)

ADSP Lecture2 - Unfolding Definitions Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. Also known as (a.k.a.) –Loop unrolling (in compilers for parallel programs) –Block processing Applications –Reducing sampling period to achieve iteration bound (desired throughput rate) T . –Parallel (block processing) to execute several iterations concurrently. –Digit-serial or bit-serial processing

ADSP Lecture2 - Unfolding Unfolding the DFG y(n)=ay(n-9)+x(n) Rewrite the algorithm formulation: y(2k)=ay(2k-9)+x(2k) y(2k+1)=ay(2k-8)+x(2k+1) y(2k)=ay(2(k-5)+1)+x(2k) y(2k+1)=ay(2(k-4))+x(2k+1) After J-folded unfolding, the clock period T = J T s, where T s is the data sampling period.

ADSP Lecture2 - Unfolding Timing Diagram Above timing diagram is obtained assuming that the sampling period T s remains unchanged. Thus, the clock period T is increased J-fold. Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later. y(0)y(1)y(2)y(3)y(4)y(5)y(6)y(7)y(8)y(9)y(10)y(11)y(12)y(13) T=T s y(0)y(2)y(4)y(6)y(8)y(10)y(12) y(1)y(3)y(5)y(7)y(9)y(11)y(13) T=2T s 9 T 4T 5T 9 T

ADSP Lecture2 - Unfolding Another DFG Unfolding Example Q S T R 3D 2D Q0Q0 S0S0 T0T0 R0R0 Q1Q1 S1S1 T1T1 R1R1 J=2 T  =3 iw(i+w)%J Step 1. Duplicate J copies of each node

ADSP Lecture2 - Unfolding Another DFG Unfolding Example Q S T R 3D 2D Q0Q0 S0S0 T0T0 R0R0 Q1Q1 S1S1 T1T1 R1R1 J=2 T  =3 iw(i+w)%J Step 2. Add all edges with 0 delay on them.

ADSP Lecture2 - Unfolding Another DFG Unfolding Example Q S T R 3D 2D Q0Q0 S0S0 T0T0 R0R0 D Q1Q1 S1S1 T1T1 R1R1 D D J=2 T  =3 T  =6 iw(i+w)%J Step 3. Use table on the left to figure out edges with delays.

ADSP Lecture2 - Unfolding Unfolding Transformation For each node U in the original DFG, draw J node U 0, U 1,…, U J-1 For each edge UV with w delays in the original DFG, draw the J edges U i V (i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1 Example Unfolding of an edge with w delays in the original DFG produces J- w edges with no delays and w edges with 1delay in J-unfolded DFG for w < J Unfolding preserves precedence constraints of a DSP algorithm

ADSP Lecture2 - Unfolding Precedence Preservation

ADSP Lecture2 - Unfolding Delay Preservation Unfolding preserves the number of delays in a DFG Let, where

ADSP Lecture2 - Unfolding Example Unfold the following DFG using folding factor 2 and 5

ADSP Lecture2 - Unfolding Properties of Unfolding Unfolding preserves the number of registers (delays) in a DFG For a loop with w delays in a DFG that has been unfolded J times, it leads to –g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing W/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of each node that appear in the original loop. Unfolding a DFG with iteration bound T  results in a J-folded DFG with iteration bound JT . A path with w (< J) delays in a DFG will lead to J - w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.

ADSP Lecture2 - Unfolding When a Loop is Unfolded A loop ℓ with w delays in a DFG Travel the loop A~>A p times  also a loop with pw delays In J-unfolded DFG, consider the path A i  A (i+pw)%J. It is a loop if i=(i+ pw)%J. This implies that J | pw The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one can travel the loop A~>A J/gcd(J, w) times. Recall that there are totally J copies of node A. Hence, there are J/(J/gcd(J,w))=gcd(J, w) loops and each loop contains w/ gcd(J, w) delays. The iteration bound in J-unfolded DFG is then

ADSP Lecture2 - Unfolding When a Path is Unfolded If w<J, then a path containing w delays within a DFG will lead to (J-w) paths with no delays and w paths with 1 delay in the J-unfolded DFG. If w≥J, then the path leads to J paths with one or more delays in the J-unfolded DFG. This implies that these paths are not critical. Assume that the critical path of the J-unfolded DFG is c. If D(U,V)≥c, then W r (UV)=W(UV)+r(V)-r(U) ≥ J Any feasible clock cycle period that can be obtained by retiming the J-unfolded DFG can be achieved by retiming the original DFG directly and followed by J-unfolding.

ADSP Lecture2 - Unfolding When a Path is Unfolded Suppose r’ is a legal retiming for the J-unfolded DFG, G J, which leads to critical path c. Let r(U) =  i r’(U i ), 0≤i≤J-1. –r is a feasible retiming for the original DFG, G. –The retiming leads to a critical path c 0≤i≤J-1 ii

ADSP Lecture2 - Unfolding Sample Period Reduction Case1: A node in the DFG having computation time greater than T ∞ Case2: Iteration bound is not an integer Case3: Longest node computation is larger than the iteration T ∞, and T ∞ is not an integer

ADSP Lecture2 - Unfolding Case 1 Critical path dominates, since a node computation time is more than iteration bound Retiming cannot be used to reduce sample period

ADSP Lecture2 - Unfolding Sample Period Reduction Rule of Thumb: T ∞ =6, T critical =6

ADSP Lecture2 - Unfolding Case 2 Iteration period cannot not achieve the iteration bound

ADSP Lecture2 - Unfolding Sample Period Reduction

ADSP Lecture2 - Unfolding Case 3

ADSP Lecture2 - Unfolding Parallel Processing Parallel processing can be performed by unfolding

ADSP Lecture2 - Unfolding Bit-Level Parallel Processing

ADSP Lecture2 - Unfolding

ADSP Lecture2 - Unfolding Bit-Serial Adder

ADSP Lecture2 - Unfolding Unfolding of Switches

ADSP Lecture2 - Unfolding Example

ADSP Lecture2 - Unfolding Example

ADSP Lecture2 - Unfolding Example

ADSP Lecture2 - Unfolding Example

ADSP Lecture2 - Unfolding Switches with Delays

ADSP Lecture2 - Unfolding Switch with Delays

ADSP Lecture2 - Unfolding If Wordlength is not a Multiple of J