A systolic array for a 2D-FIR filter for image processing

Slides:

Advertisements

Similar presentations

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Chapter 4 Retiming.

Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

System Development. Numerical Techniques for Matrix Inversion.

Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

DSP Algorithms on FPGA Part II Digital image Processing

Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Martin Kruliš by Martin Kruliš (v1.1)1.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond

Canvas and Arrays in Apps

Computational Thinking, Problem-solving and Programming: General Principals IB Computer Science.

Adaptive Median Filter

Dynamo: A Runtime Codesign Environment

Ph.D. in Computer Science

Analysis of Algorithms

Introduction Introduction to VHDL Entities Signals Data & Scalar Types

Pipelining and Retiming 1

Parallel Programming By J. H. Wang May 2, 2017.

Embedded Systems Design

Serial Multipliers Prawat Nagvajara

FPGA Acceleration of Convolutional Neural Networks

Cache Memory Presentation I

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Hyperthreading Technology

Anne Pratoomtong ECE734, Spring2002

Introduction to cosynthesis Rabi Mahapatra CSCE617

Array Processor.

湖南大学-信息科学与工程学院-计算机与科学系

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

STUDY AND IMPLEMENTATION

Simulation of computer system

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Sanjoy Baruah The University of North Carolina at Chapel Hill

Single Cycle vs. Multiple Cycle

A Unified Framework for Schedule and Storage Optimization

Siddhartha Chatterjee

Final Project presentation

ICS 252 Introduction to Computer Design

Partition Refinement 更多精品PPT模板下载： 1

Superscalar and VLIW Architectures

Mattan Erez The University of Texas at Austin

Query Optimization.

Mapping DSP algorithms to a general purpose out-of-order processor

Implementation of a De-blocking Filter and Optimization in PLX

Parallel Programming in C with MPI and OpenMP

Cache Performance Improvements

Principle of Locality: Memory Hierarchies

EE 345S Real-Time Digital Signal Processing Lab Spring 2009

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

A systolic array for a 2D-FIR filter for image processing Sebastian Siegel ECE 734 

Outline Why Systolic Arrays (SA)? Design Issues Approach Solution Result

Why Systolic Arrays? (1) 4-level nested do-loop:

Why Systolic Arrays? (2) Sequential execution on one MAC requires too much time Example: image: 512x512, filter: 3x3 2.3 Million operations @ 10 Mhz = 0.23 s Algorithm in nested do-loop structure  Single Assignment Format possible  Parallel execution possible Systematic approach vs. “rocket science”

Design Issues Recall: Avoid multiple access to the same data by pipelining it Minimize execution time and registers Maximize Usage of Processing Elements (PEs)

Approach (1) Steps: Rewrite Algorithm in Single Assignment Format (SAF) Draw and examine Dep. Graph (DG) Map DG to SA by generating suitable solutions and chose an optimal one Problem: SA too big  partitioning  data reaccessed or cache needed

Approach (2) Partitioning of the DG generates even more (and better) solutions:

Solution

Result Fully pipelined SA 100% PE utilization SA can be partitioned with relatively small cache and 100% data reuse or without cache and high data reuse PEs and their interconnections (# of registers per pipeline) independent of filter size Low latency for the results Constant I/O rate Fast MATLAB® implementation