Algorithmic Transformations


Similar presentations
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
ECE 667 Synthesis and Verification of Digital Circuits
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Chapter 4 Retiming.
Max Flow Problem Given network N=(V,A), two nodes s,t of V, and capacities on the arcs: uij is the capacity on arc (i,j). Find non-negative flow fij for.
Global Flow Optimization (GFO) in Automatic Logic Design “ TCAD91 ” by C. Leonard Berman & Louise H. Trevillyan CAD Group Meeting Prepared by Ray Cheung.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)
ELEC692 VLSI Signal Processing Architecture Lecture 4
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Spring 08, Feb 28 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2008 Retiming Vishwani D. Agrawal James J. Danaher.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
TCOM 501: Networking Theory & Fundamentals
Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.
1 Retiming Outline: ProblemProblem FormulationFormulation Retiming algorithmRetiming algorithm.
All-Pairs Shortest Paths
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
ELEC692 VLSI Signal Processing Architecture Lecture 6
ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
Chapter 5 Unfolding.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.
Digital Signals and Systems
ELEC692 VLSI Signal Processing Architecture Lecture 1
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
L7: Pipelining and Parallel Processing VADA Lab..
Copyright © 2001, S. K. Mitra Digital Filter Structures The convolution sum description of an LTI discrete-time system be used, can in principle, to implement.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Copyright  SCRA 1 Methodology Reinventing Electronic Design Architecture Infrastructure DARPA Tri-Service RASSP Scheduling and Assignment for.
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
Folding Technique: Compromising in Special Purpose Hardware Design
ELEC692 VLSI Signal Processing Architecture Lecture 3
1 Retiming and Re-synthesis Outline: RetimingRetiming Retiming and Resynthesis (RnR)Retiming and Resynthesis (RnR) Resynthesis of PipelinesResynthesis.
Digital Signal Processing
Pipelining and Retiming
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Exploiting Parallelism
Chapter 9: Graphs.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik
Digital Logic Design Alex Bronstein Lecture 2: Pipelines.
CS137: Electronic Design Automation
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
102-1 Under-Graduate Project Techniques in VLSI design
{ Storage, Scaling, Summation }
ELEC 7770 Advanced VLSI Design Spring 2012 Retiming
Tsung-Hao Chen and Kuang-Ching Wang May
101-1 Under-Graduate Project Techniques in VLSI design
Multiplier-less Multiplication by Constants
CS184a: Computer Architecture (Structures and Organization)
ELEC 7770 Advanced VLSI Design Spring 2016 Retiming
Timing Analysis and Optimization of Sequential Circuits
Real time signal processing
Murugappan Senthilvelan May 4th 2004
CS 201 Compiler Construction
Presentation transcript:

Algorithmic Transformations

Goals The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP) No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation. This will require understanding aspects of timing, pipelining, parallelism (C)2002-2004 Yu Hen Hu

Overview Algorithm Representations and Iteration Bound Parallelism and Pipelining Retiming Unfolding Folding (C)2002-2004 Yu Hen Hu

(C)2002-2004 Yu Hen Hu

(C)2002-2004 Yu Hen Hu

(C)2002-2004 Yu Hen Hu

Data Flow Graph Node: Direct edge: Delay: iteration count Example Computation Associated with a computing time. Direct edge: data path and delay Delay: iteration count Example y(n) = a*y(n-1) + b*u(n) The delay of 1 u.t. indicates that to compute y(n+1) in the next iteration depends on result y(n) of the present iteration. Delay labeled with D or positive integer on edges (C)2002-2004 Yu Hen Hu

DFG Intra-iteration dependency Inter-iteration dependency x(n) D D Intra-iteration dependency A direct edge without any delay Inter-iteration dependency Direct edge with 1 or more delays Node computing delay labeled with parenthesis. Critical path: longest path between registers Example: critical path delay = 4+2+2 = 8 t.u. M0 (4) M1 (4) M2 (4) y(n) A0 A1 (2) (2) Recursive DFG: contains loops. Must have at least one delay element along any loop. Otherwise, the algorithm is NON-computable! (C)2002-2004 Yu Hen Hu

Loop bound and Iteration bound (2) (4) (5) A B C 2D (2) (4) A B T{A-B-A} = (2+4)/2 = 3 t.u. T = max{(2+4)/2, (2+4+5)/1} = max{3, 11} = 11 2D (C)2002-2004 Yu Hen Hu

(C)2002-2004 Yu Hen Hu

(C)2002-2004 Yu Hen Hu

Solution To achieve high-speed, the length of the critical path can be reduced by pipelining and parallel processing (C)2002-2004 Yu Hen Hu

Overview Algorithm Representations and Iteration Bound Parallelism and Pipelining Retiming Unfolding Folding (C)2002-2004 Yu Hen Hu

Basic Ideas Parallel processing Pipelined processing time time P1 P2 Less inter-processor communication Complicated processor hardware More inter-processor communication Simpler processor hardware Colors: different types of operations performed a, b, c, d: different data streams processed (C)2002-2004 Yu Hen Hu

Data Dependence time time Parallel processing requires NO data dependence between processors Pipelined processing will involve inter-processor communication P1 P2 P3 P4 P1 P2 P3 P4 time time (C)2002-2004 Yu Hen Hu

Usage of Pipelined Processing By inserting latches or registers between combinational logic circuits, the critical path can be shortened. Consequence: reduce clock cycle time, increase clock frequency. Suitable for DSP applications that have (infinity) long data stream. Method to incorporate pipelining: Cut-set retiming Cut set: A cut set is a set of edges of a graph. If these edges are removed from the original graph, the remaining graph will become two separate graphs. Retiming: The timing of an algorithm is re-adjusted while keeping the partial ordering of execution unchanged so that the results correct (C)2002-2004 Yu Hen Hu

Pipelining (C)2002-2004 Yu Hen Hu

Pipelining of FIR filters (C)2002-2004 Yu Hen Hu

Pipelining (C)2002-2004 Yu Hen Hu

Fine-grain pipelining To further reduce TM. Critical Path = Max {TM1, TM2, TA} (C)2002-2004 Yu Hen Hu

Graphic Transpose Theorem The transfer function of a signal flow graph remain unchanged if The directions of each arc is reversed The input and output labels are switched. z-1 x[n] y[n] h[2] h[1] h[0] y[n] z-1 u[n] z-1 = ? h[0] h[1] h[2] x[n] (C)2002-2004 Yu Hen Hu

Data broadcast structure Algorithm transform may lead to pipelined structure without adding additional delays. Given a FIR filter SFG Critical path TM+2TA Use graph transposition theorem: Reverse all arcs Reverse input/output We obtain Critical path Max(TM, TA) No additional delay added! (C)2002-2004 Yu Hen Hu

Block Processing One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) Block vector: [x(3k) x(3k+1) x(3k+2)] Clock cycle: can be 3 times longer Original (FIR filter): Rewrite 3 equations at a time: Define block vector Block formulation: (C)2002-2004 Yu Hen Hu

Block Processing (C)2002-2004 Yu Hen Hu

General approach for block processing (C)2002-2004 Yu Hen Hu

(C)2002-2004 Yu Hen Hu

Timing Comparison x(1) x(2) x(3) x(4) MAC 1 2 3 4 y(1) y(2) y(3) y(4) Pipelining Block processing x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7) Add 1 2 3 4 5 6 7 8 y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7) a y(1) Mul 1 2 3 4 5 6 7 8 x(2) x(4) x(6) x(8) 2 2 4 4 6 6 8 8 x(1) x(3) x(5) x(7) 1 1 3 3 5 5 7 7 (C)2002-2004 Yu Hen Hu

Overview Algorithm Representations and Iteration Bound Parallelism and Pipelining Retiming Unfolding Folding (C)2002-2004 Yu Hen Hu

Definitions Retiming Purposes Retiming is a mapping from a given DFG, G to a retimed DFT, Gr such that the corresponding transfer function of G and Gr differ by a pure delay z-L. Purposes To facilitate pipelining to reduce clock cycle time To reduce number of registers needed. (C)2002-2004 Yu Hen Hu

Cut Set Retiming (C)2002-2004 Yu Hen Hu

Cut set delay transfer (C)2002-2004 Yu Hen Hu

Cut-set delay transfer failure (C)2002-2004 Yu Hen Hu

Cut-set Retiming Delay transfer theorem Feed-forward cut-set: Feed-back cut-set Delay transfer theorem Adding arbitrary non-negative number of delays to each edge of a feed-forward cut-set of a DFG will not alter its output, except the output timing will be delayed. Transfer the same amount of delays from edges of the same direction across a feed-back cut set of a DFG to all edges of opposing edges across the same cut set will not alter the output, but its timing. (C)2002-2004 Yu Hen Hu

Feed-forward Cut-Set Retiming Consider the FIR digital filter and its DFG: y(n) = b0x(n) + b1x(n-1) Critical path length = TM+TA Select a cut set Insert a delay each to each edge in the cut set. Retiming: ynew(n) = b0x(n-1) + b1x(n-2) ynew(n) = y(n-1) Critical path = Max(TM, TA) D x(n) x(n-1) X b0 X b1 D x(n) x(n-1) + y(n) X b0 X b1 D D + y(n) (C)2002-2004 Yu Hen Hu

Feed-back Cut Set Retiming Consider an IIR digital filter y(n) = a·y(n-2) + x(n) loop bound = (TM+TA)/2 clock cycle = TM+TA Shift 1 delay to the other edge across a feed-back cut set Filter remains unchanged. loop bound = (TM+TA)/2 clock cycle = Max(TM ,TA) x(n) y(n) x(n) y(n) + + 2D D D a a   (C)2002-2004 Yu Hen Hu

Feed-back Cut Set Retiming Consider an IIR digital filter y(n) = ay(n-1) + x(n) loop bound = (TM+TA) throughput = 1/(TM+TA) x(2k-1)=x(k) x(2k) = 0 Clock period = (TM+TA) Throughput = 1/[2(TM+TA)] x(n) y(n) + x(m) y(m) + D 2D a  a  (C)2002-2004 Yu Hen Hu

Time scaling (C)2002-2004 Yu Hen Hu

Slowing down the input rate (C)2002-2004 Yu Hen Hu

Loss of Efficiency (C)2002-2004 Yu Hen Hu

Slowdown + Retiming   + + Start with y(n) = a y(n-1) + x(n) clock cycle = Max(TM ,TA) Throughput = 1/[2max(TM,TA)] Start with y(n) = a y(n-2) + x(n) loop bound = (TM+TA)/2 clock cycle = Max(TM ,TA) throughput = 1/ Max(TM ,TA) x(n) y(n) x(m) y(m) + + D D D D a a   (C)2002-2004 Yu Hen Hu

Slow Down for Cut-Set Retiming (C)2002-2004 Yu Hen Hu

Example of retiming Node delay = 1 t.u. Before retiming: Critical path: a3  a4  a5  a6 Clock cycle time = 4 2 delay units After cut-set retiming Critical path: a3  a5, a4  a6 Clock cycle time = 2 6 delay units After additional retiming Critical path: none Clock cycle time = 1 11 delay units a5 a3 D a1 a2 a3 a4 a5 a6 2D a4 a2 D D a6 2D a1 D D D 2D a3 a5 (C)2002-2004 Yu Hen Hu

Node Retiming v v … Retiming equation: e v u Transfer delay through a node in DFG: r(v) = # of delays transferred from out-going edges to incoming edges of node v w(e) = # of delays on edge e wr(e) = # of delays on edge e after retiming Retiming equation: subject to wr(e)  0. Let p be a path from v0 to vk then e u v D 3D 2D r(v) = 2 v v 2D D 3D v0 e0 v1 e1 … vk ek p (C)2002-2004 Yu Hen Hu

Invariant Properties Retiming does NOT change the total number of delays for each cycle. Retiming does not change loop bound or iteration bound of the DFG If the retiming values of every node v in a DFG G are added to a constant integer j, the retimed graph Gr will not be affected. That is, the weights (# of delays) of the retimed graph will remain the same. (C)2002-2004 Yu Hen Hu

Node Retiming Examples (C)2002-2004 Yu Hen Hu

DFG Illustration of the Example T = max. {(1+2+1)/2, (1+2+1)/3} = 2 Cr. Path delay = 2+1 = 3 t.u T = max. {(1+2+1)/2, (1+2+1)/3} = 2 Cr. Path Delay = max{2,2,1+1} = 2 t.u (C)2002-2004 Yu Hen Hu

Retiming for Minimizing Clock Period Note that retiming will NOT alter iteration bound T. Iteration bound is the theoretical minimum clock period to execute the algorithm. Let edge e connect node u to node v. If the node computing time t(u) + t(v) > T, then clock period T > T. For such an edge, we require that To generalize, for any path from v0 to vk, we have In other words, for any possible critical path in the DFG that is larger than T, we require wr(e)  1. (C)2002-2004 Yu Hen Hu

Retiming Example Revisited wr(e21)  0, since t(2)+t(1) = 2 = T. wr(e13)  1, since t(1)+t(3) = 3 > T. wr(e14)  1, since t(1)+t(4) = 3 > T. wr(e32)  1, since t(3)+t(2) = 3 > T. wr(e42)  1, since t(4)+t(2) = 3 > T. Use eq. wr(euv) = w(e) + r(v) – r(u), w(e21) + r(1) – r(2) = 1 + r(1) – r(2)  0 w(e13) + r(3) – r(1) = 1 + r(3) – r(1)  1 w(e14) + r(4) – r(1) = 2 + r(4) – r(1)  1 w(e32) + r(2) – r(3) = 0 + r(2) – r(3)  1 w(e42) + r(2) – r(4) = 0 + r(2) – r(4)  1 (C)2002-2004 Yu Hen Hu

Solution continues Since the retimed graph Gr remain the same if all node retiming values are added by the same constant. We thus can set r(1) = 0. The inequalities become 1 – r(2)  0 or r(2)  1 1 + r(3)  1 or r(3)  0 2 + r(4)  1 or r(4)  –1 r(2) – r(3)  1 or r(3) r(2) - 1 r(2) – r(4)  1 or r(2)  r(4) + 1 Since one must have r(2) = +1. This implies r(3)  0. But we also have r(3)  0. Hence r(3)=0. These leave –1  r(4)  0. Hence the two sets of solutions are: r(3) = 0, r(2) = +1, and r(4) = 0 or -1. (C)2002-2004 Yu Hen Hu

Systematic Solutions Given a systems of inequalities: r(i) – r(j)  k; 1  i,j  N Construct a constraint graph: Map each r(i) to node i. Add a node N+1. For each inequality r(i) – r(j)  k, draw an edge eji such that w(eji) = k. Draw N edges eN+1,i = 0. The system of inequalities has a solution if and only if the constraint graph contains no negative cycles If a solution exists, one solution is where ri is the minimum length path from the node N+1 to the node i. Shortest path algorithms: Bellman-Ford algorithm Floyd-Warshall algorithm (C)2002-2004 Yu Hen Hu

Overview Algorithm Representations and Iteration Bound Parallelism and Pipelining Retiming Unfolding Folding (C)2002-2004 Yu Hen Hu

Definitions Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. Also known as Loop unrolling (in compilers for parallel programs) Block processing Applications Reducing sampling period to achieve iteration bound (desired throughput rate) T. Parallel (block processing) to execute several iterations concurrently. Digit-serial or bit-serial processing (C)2002-2004 Yu Hen Hu

An example Block processing formulation J = 3, 9/J = 3 (an integer) Before unfolding: For n = 0 to N-1, y(n)=a*y(n-9)+x(n) end Unfolding once (J = 2) For k = 0 to N/2-1, y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1) Unfolding twice (J = 3) For k = 0 to N/3-1, y(3k)=a*y(3k-9)+x(3k) y(3k+1)=a*y(3k-8)+x(3k+1) y(3k+2)=a*y(3k-7)+x(3k+2) Block processing formulation J = 3, 9/J = 3 (an integer) X(k) = [x(3k) x(3k+1) x(3k+2)]T Y(k) = [y(3k) y(3k+1) y(3k+2)]T Y(k) = a*Y(k- 3 ) + X(k) J = 2, 9/J = ? (not an integer) X(k) = [x(2k) x(2k+1)]T Y(k) = [y(2k) y(2k+1)]T Y(k) = a*Y(k- ? ) + X(k) (C)2002-2004 Yu Hen Hu

Unfolding the DFG Rewrite the algorithm formulation: y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1) y(2k)=a*y(2(k-5)+1)+x(2k) y(2k+1)=a*y(2(k-4))+x(2k+1) After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period. T=Ts T=J Ts (C)2002-2004 Yu Hen Hu

General DFG Unfolding Method Define Step 1. For each node U in original DFG, draw J nodes {Ui; 0 iJ-1} in the unfolded DFG Step 2. For each edge from U to V with w delays, draw J edges from Ui to V(i+w)%J with (i+w)/J delays (C)2002-2004 Yu Hen Hu

Another DFG Unfolding Example J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 1. Duplicate J copies of each node (C)2002-2004 Yu Hen Hu

Another DFG Unfolding Example J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 2. Add all edges with 0 delay on them. (C)2002-2004 Yu Hen Hu

Another DFG Unfolding Example J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S D R0 Q T 2D D 3D 2D S1 R Q1 T1 T=3 D R1 Step 3. Use table on the left to figure out edges with delays. T=6 (C)2002-2004 Yu Hen Hu

Properties of Unfolding Unfolding preserves the number of registers (delays) in a DFG For a loop with w delays in a DFG that has been unfolded J times, it leads to g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing w/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of each node that appear in the original loop. Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT. A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. Any path in the original DFT containing J or more delays leads to J paths 2ith 1 or more delay in each path. Therefore, it can not create a critical path in the J-unfolded DFT Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding. (C)2002-2004 Yu Hen Hu