A systolic array for a 2D-FIR filter for image processing

Slides:



Advertisements
Similar presentations
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Chapter 4 Retiming.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
System Development. Numerical Techniques for Matrix Inversion.
Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
DSP Algorithms on FPGA Part II Digital image Processing
Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Martin Kruliš by Martin Kruliš (v1.1)1.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond
Canvas and Arrays in Apps
Computational Thinking, Problem-solving and Programming: General Principals IB Computer Science.
Adaptive Median Filter
Dynamo: A Runtime Codesign Environment
Ph.D. in Computer Science
Analysis of Algorithms
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Pipelining and Retiming 1
Parallel Programming By J. H. Wang May 2, 2017.
Embedded Systems Design
Serial Multipliers Prawat Nagvajara
FPGA Acceleration of Convolutional Neural Networks
Cache Memory Presentation I
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Hyperthreading Technology
Anne Pratoomtong ECE734, Spring2002
Introduction to cosynthesis Rabi Mahapatra CSCE617
Array Processor.
湖南大学-信息科学与工程学院-计算机与科学系
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
STUDY AND IMPLEMENTATION
Simulation of computer system
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Sanjoy Baruah The University of North Carolina at Chapel Hill
Single Cycle vs. Multiple Cycle
A Unified Framework for Schedule and Storage Optimization
Siddhartha Chatterjee
Final Project presentation
ICS 252 Introduction to Computer Design
Partition Refinement 更多精品PPT模板下载: 1
Superscalar and VLIW Architectures
Mattan Erez The University of Texas at Austin
Query Optimization.
Mapping DSP algorithms to a general purpose out-of-order processor
Implementation of a De-blocking Filter and Optimization in PLX
Parallel Programming in C with MPI and OpenMP
Cache Performance Improvements
Principle of Locality: Memory Hierarchies
EE 345S Real-Time Digital Signal Processing Lab Spring 2009
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

A systolic array for a 2D-FIR filter for image processing Sebastian Siegel ECE 734 

Outline Why Systolic Arrays (SA)? Design Issues Approach Solution Result

Why Systolic Arrays? (1) 4-level nested do-loop:

Why Systolic Arrays? (2) Sequential execution on one MAC requires too much time Example: image: 512x512, filter: 3x3 2.3 Million operations @ 10 Mhz = 0.23 s Algorithm in nested do-loop structure  Single Assignment Format possible  Parallel execution possible Systematic approach vs. “rocket science”

Design Issues Recall: Avoid multiple access to the same data by pipelining it Minimize execution time and registers Maximize Usage of Processing Elements (PEs)

Approach (1) Steps: Rewrite Algorithm in Single Assignment Format (SAF) Draw and examine Dep. Graph (DG) Map DG to SA by generating suitable solutions and chose an optimal one Problem: SA too big  partitioning  data reaccessed or cache needed

Approach (2) Partitioning of the DG generates even more (and better) solutions:

Solution

Result Fully pipelined SA 100% PE utilization SA can be partitioned with relatively small cache and 100% data reuse or without cache and high data reuse PEs and their interconnections (# of registers per pipeline) independent of filter size Low latency for the results Constant I/O rate Fast MATLAB® implementation