A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO.

Slides:

Advertisements

Similar presentations

Field Programmable Gate Array

Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Give qualifications of instructors: DAP

Chapter 9 Computer Design Basics. 9-2 Datapaths Reminding A digital system (or a simple computer) contains datapath unit and control unit. Datapath: A.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

System Development. Numerical Techniques for Matrix Inversion.

Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.

Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 7 Multiprocessors and Multicomputers 7.1 Multiprocessor System Interconnects.

George Mason University ECE 448 – FPGA and ASIC Design with VHDL Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts,

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.

Introduction to Parallel Processing Ch. 12, Pg

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.

Multiprocessor systems Objective n the multiprocessors’ organization and implementation n the shared-memory in multiprocessor n static and dynamic connection.

ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.

ECE 465 Introduction to CPLDs and FPGAs Shantanu Dutt ECE Dept. University of Illinois at Chicago Acknowledgement: Extracted from lecture notes of Dr.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

1 Dynamic Interconnection Networks Miodrag Bolic.

CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.

Switches and indirect networks Computer Architecture AMANO, Hideharu Textbook pp. ９２～１３ 0.

Languages for HW and SW Development Ondrej Cevan.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

Basic Sequential Components CT101 – Computing Systems Organization.

Shanghai Jiao Tong University 2012 Indirect Networks or Dynamic Networks Guihai Chen …with major presentation contribution from José Flich, UPV (and Cell.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”

Birds Eye View of Interconnection Networks

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Shared Memory switches Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Ch 8. Switching. Switch  Devices that interconnected with each other  Connecting all nodes (like mesh network) is not cost-effective  Some topology.

INTERCONNECTION NETWORKS Work done as part of Parallel Architecture Under the guidance of Dr. Edwin Sha By Gomathy Gowri Narayanan Karthik Alagu Dynamic.

PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

System on a Programmable Chip (System on a Reprogrammable Chip)

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.

Programmable Hardware: Hardware or Software?

Overview Parallel Processing Pipelining

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Dynamic connection system

buses, crossing switch, multistage network.

Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.

Multiprocessor Introduction and Characteristics of Multiprocessor

INTERCONNECTION NETWORKS

Indirect Networks or Dynamic Networks

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

buses, crossing switch, multistage network.

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

HIGH LEVEL SYNTHESIS.

Birds Eye View of Interconnection Networks

International Data Encryption Algorithm

ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.

Research: Past, Present and Future

Presentation transcript:

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO. 1, Feb 1996

Introduction  VLSI design phases  Generic processor vs. ASIC  Programmable architectures  Architecture design PE architecture  Parallel architecture  Memory bandwidth Data-flow design  Pipeline flow Control circuit  H/W or Prog. specification behavior register- transfer logic circuit layout Controller unit PE Array Architecture Memory Data

Architecture of PE

SAD PE element

A data-flow design for a full-search block-matching motion estimator

The basic ideas  A general-purpose interconnect network whose topology supports arbitrary paths from ME ’ s to PE ’ s.  A memory partitioning scheme that allows the required memory accesses, and  programmable interconnect and PE ’ s controlled by a stored-program controller.

An abstract architectural model for the proposed motion-estimator

Interconnection Networks  Multistage network Benes, Crossbar, Omega, etc.  A simple combination of multiplexers or a direct connection between the memory and the processing elements.  Each frame memory can be implemented as either an interleaved set of multiple banks or a single block of dual-port RAM.

Data-flow design for TSS  Eight processors will be needed for each step  Each of the TSS takes 256 cycles  The size and the cost of a memory increase considerably with the number of ports.  Computer architects and circuit designers usually restrict the # of ports to two or three.  The usage of a 9-port memory for implementing the TSS is highly impractical.

Nine shifts tested in step 1 of a three- step search

Data-flow for step 1 of a three-step search procedure

Two solutions with different memory partitioning schemes  Broadcasting the Previous-Frame Data  Broadcasting the Current-Frame Data

Broadcasting the Previous-Frame Data  b(4,12) is required by PE 8 in cycle 0, by PE 5 in cycle 8, by PE 1 in cycle 4, and by other PE ’ s in some other cycles.  Solve the memory-bandwidth problem by aligning the b(.) data carefully.  At most two different b(.) values in a cycle.  Problems TSS could not be completed in 768 cycles. The a(.) data are now misaligned and therefore cause memory-access conflicts.

Revised data-flow for step 1 of a three- step search procedure (1)

Revised data-flow for step 1 of a three- step search procedure (2)

Broadcasting the Previous-Frame Data  16 smaller memory banks  A multistage, 16-port interconnection network  Supplying appropriate memory bandwidth is critical to maintaining the throughput of a BM architecture.

Two different conflicts  The memory conflicts Arise when two different a(.) values that reside in the same memory bank are needed in the same cycle.  The path conflicts Arise in an interconnection network when one path ( a connection from a src to a dest through s/w) is blocked by another existing path.

Derived of conflict-free schedule  A memory partitioning scheme and a processor assignment scheme are first chosen, through simulation of different memory-partitioning and processor assignment schemes. The number of conflicts is not prohibitively large Cycles which do not have conflicts are left unchanged and the ones that have conflicts are recursively broken into sub-cycles.

Motion estimator architecture: broadcasting previous-frame data

Broadcasting the Current-Frame Data  To implement the original TSS data-flow.  a(.) is broadcasted.  b(.) is partitioned into 16 memory banks.

Motion estimator architecture: broadcasting current-frame data

Performance of the motion estimator  The simulator takes as input: A data-flow description of a BMA  specifying the # of PE ’ s and the ideal flow of the pixel data. A memory configuration  Specifying the # of ME ’ s and the # of memory ports. A network characterization  Specifying the topology of the interconnection network between the PE ’ s and the ME ’ s. The pipelining information  Specifying the number of pipeline stages in each PE and the network.  Determines the network-path and memory- access conflicts.

Interconnection networks  Completely connected Network N 2 crosspoint switches are needed in a single-stage  Crossbar N port (N in, N out) multistage network May not be possible to free all path conflicts  Generalized Cube and Omega N-port network, log 2 N stages with N/2 switches in each stage  Benes 2 log 2 N-1 switch stages

Memory-partitioning scheme without data duplication

Data duplicated memory-partitioning

Simulation results for different networking and memory-partitioning schemes

Simulation results for different pixel distributions

Data-flow for step 1 of the conjugate- direction search

Data-flow for step 2 of the conjugate- direction search

Conclusions  An engine that can be adapted to multiple motion-estimation algorithms.