Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.

Slides:



Advertisements
Similar presentations
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Advertisements

Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
PERMUTATION CIRCUITS Presented by Wooyoung Kim, 1/28/2009 CSc 8530 Parallel Algorithms, Spring 2009 Dr. Sushil K. Prasad.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Spring 08, Mar 11 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2008 Zero - Skew Clock Routing Vishwani D. Agrawal.
Chapter 4 Retiming.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
Parallel Architectures: Topologies Heiko Schröder, 2003.
IP Address Lookup for Internet Routers Using Balanced Binary Search with Prefix Vector Author: Hyesook Lim, Hyeong-gee Kim, Changhoon Publisher: IEEE TRANSACTIONS.
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
1 Processor Array Architectures for Deep Packet Classification Authors: Fayez Gebali and A.N.M. Ehtesham Rafiq Publisher: IEEE Transactions on Parallel.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Lecture 21: Parallel Algorithms
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
Shortest Path With Negative Weights s 3 t
A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO.
Logic Simulation 2 Outline –Timing Models –Simulation Algorithms Goal –Understand timing models –Understand simulation algorithms Reading –Gate-Level Simulation.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
ELEC692 VLSI Signal Processing Architecture Lecture 6
Algorithmic Transformations
A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
DELAY INSERTION METHOD IN CLOCK SKEW SCHEDULING BARIS TASKIN and IVAN S. KOURTEV ISPD 2005 High Performance Integrated Circuit Design Lab. Department of.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Time-Domain Representations of LTI Systems
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Husheng Li, UTK-EECS, Fall  Study how to implement the LTI discrete-time systems.  We first present the block diagram and signal flow graph. 
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
Copyright  SCRA 1 Methodology Reinventing Electronic Design Architecture Infrastructure DARPA Tri-Service RASSP Scheduling and Assignment for.
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
Spring 2014, Mar 17...ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2014 Zero - Skew Clock Routing Vishwani D. Agrawal.
ELEC692 VLSI Signal Processing Architecture Lecture 3
Vector and symbolic processors
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
Introduction to Image and Video Coding Algorithms
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Static Timing Analysis
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
EENG 751 3/16/ EENG 751: Signal Processing I Class # 9 Outline Signal Flow Graph Implementation l Fundamentals l System Function l Graph Construction.
Retiming EECS 290A Sequential Logic Synthesis and Verification.
Intro. ANN & Fuzzy Systems Lecture 3 Basic Definitions of ANN.
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 8: January 27, 2006 Cellular Placement.
1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.
VLSI Testing Lecture 5: Logic Simulation
Serial Multipliers Prawat Nagvajara
Outline Announcement Local operations (continued) Linear filters
A systolic array for a 2D-FIR filter for image processing
ECE-C302 Bit-serial Multiplication Part 1 Prawat Nagvajara
Nested Loop Structure for Fixed Size ME
Sum of Absolute Differences Hardware Accelerator
Iterative Deletion Routing Algorithm
كلية المجتمع الخرج البرمجة - المستوى الثاني
CS3291: "Interrogation Surprise" on Section /10/04
Zhongguo Liu Biomedical Engineering
Secondary Sort  Problem: Sorting on values
Presentation transcript:

Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001

OUTLINE Methodology for VLSI Array Processors Design An Example on Frame Level Block Matching Algorithm

Design Levels Sequential Algorithm 1.DG Design 2.SFG Design 3.VLSI Array Design

Dependence Graph (DG)

DG: 1.Shift Invariant Shift-Unvariant DG for Sorting Algorithm For i from 1 to N For j from 1 to i m( i +1, j ) <- max[ x ( i, j ), m( i, j )] x( i, j +1) <- min[ x ( i, j ),m( i, j )]

DG: 2.Localization Broadcast vs. Transmittent Data

DG: 3.Reversible Arcs for Associative Operations If the operation used in the recursion is associative, then the directions of the arcs may be reversible.

DG: 4.Localization with Intermediate Variables Involved AR Filtering Algorithm

DG: 4.Localization with Intermediate Variables Involved AR Filtering Algorithm –Spiral Communication Approach –Local Communication Approach

Signal Flow Graph (SFG) Input(1) Output(1) Input(2) Output(2) D x(n) x(n-1)

SFG Projection Procedure For any projection direction, a processor space is orthogonal to the projection direction. Replace the arcs in the DG with zero or nonzero delay edges between their corresponding processors. Attach the input and output data to their corresponding processors.

Projection Example Insertion sorting Insertion Sorting Selection sorting Bubble sorting Selection Sorting Insertion Sorting

SFG to Systolic Array Replace Operation Node with PE. Place data and Input/Output pin with delay units.

Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001

Six-level nested Do-loop FSBM

Two-level nested Do-loop FSBM

k th -clock cycle (v-1)N h N 2 (h-1)N 2 (i-1)N j k th -clock cycle

2D Localized DG of row 1, v =1 Search area and current frame coordinates of N v = 3; N h = 2; p =N/2 = 1. 2 p +1

Linear SFG of (2p + 1) 2 PEs, p = N/2 = 1 after systolic mapping of 2-D DG.

Systolic array with spiral interconnections

Microarchitecture of PE

Scheduled search area data

Performance