Wavelet “Block-Processing” for Reduced Memory Transfers

Slides:

Advertisements

Similar presentations

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Advertisements

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Chapter 1 and 2 Computer System and Operating System Overview

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Study of AES Encryption/Decription Optimizations Nathan Windels.

Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Synchronization and Communication in the T3E Multiprocessor.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

DLS Digital Controller Tony Dobbing Head of Power Supplies Group.

Operating systems, lecture 4 Team Viewer Tom Mikael Larsen, Thursdays in D A look at assignment 1 Brief rehearsal from lecture 3 More about.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.

Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

Mr. Daniel Perkins Battelle Memorial Institute Mr. Rob Riley Air Force Research Laboratory Gateware Munitions Interface Processor (GMIP)

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

COMP541 Memories II: DRAMs

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

1 TurriMAPLD 2005/P160 Image “Padding” In Limited-Memory FPGA Systems William Turri Ken Simone Systran.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Buffering Techniques Greg Stitt ECE Department University of Florida.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

Physical Memory and Physical Addressing ( Chapter 10 ) by Polina Zapreyeva.

CMSC 611: Advanced Computer Architecture

COMP541 Memories II: DRAMs

Presenter: Darshika G. Perera Assistant Professor

Nios II Processor: Memory Organization and Access

Backprojection Project Update January 2002

Computer Organization

Virtual memory.

ESE532: System-on-a-Chip Architecture

Hiba Tariq School of Engineering

Reducing Hit Time Small and simple caches Way prediction Trace caches

CSC 4250 Computer Architectures

CoBo - Different Boundaries & Different Options of

COMP541 Memories II: DRAMs

Multiprocessor Cache Coherency

Cache Memory Presentation I

TAO1221 COMPUTER ARCHITECTURE AND ORGANIZATION LAB 3 & 4 Part 1

Basic Computer Organization

Chapter III Desktop Imaging Systems & Issues

William Stallings Computer Organization and Architecture 7th Edition

Memory chips Memory chips have two main properties that determine their application, storage capacity (size) and access time(speed). A memory chip contains.

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Guest Lecturer TA: Shreyas Chand

Wavelet Transform Fourier Transform Wavelet Transform

Computer Organization & Architecture 3416

Memory System Performance Chapter 3

ECE 352 Digital System Fundamentals

Presentation transcript:

Wavelet “Block-Processing” for Reduced Memory Transfers MAPLD 2005 Conference Presentation William Turri (wturri@systranfederal.com) Ken Simone (kcsim07@yahoo.com) Systran Federal Corp. 4027 Colonel Glenn Highway, Suite 210 Dayton, OH 45431-1672 937-429-9008 x104

Research Goals Develop, test, and implement an efficient wavelet transform algorithm for fast hardware compression of SAR images Algorithm should make optimal use of available memory, and minimize the number of memory access operations required to transform an image

Wavelet Transform Background Original Image Row Transformed Image Wavelet Transformed Image Low Frequency (Scaling) Coefficients High Frequency (Wavelet) Coefficients

Multiple Resolution Levels Wavelet Transformed Images (MR-Level = 1) (MR-Level = 2) (MR-Level = 3)

Results of Applying Wavelet Transform Wavelet Coefficients MR-Level = 1 Scaling Coefficients Wavelet Coefficients MR-Level = 3 Wavelet Coefficients MR-Level = 2

Preliminary Investigation Prior work has made use of the Integer Haar Wavelet Transform This wavelet is computationally simple Each filter (low and high pass) has only two taps Haar does not provide a sharp separation between high and low frequencies More complex wavelets provide generally better quality, at the cost of increased computational complexity

Standard Memory Requirements The most basic implemenation of the transform requires that all rows be transformed by the wavelet filters before the columns This approach requires an intermediate storage area, most likely SRAM or SDRAM on a hardware implementation These redundant memory access operations greatly reduce the performance of the overall implementation

Standard Memory Requirements L H LL LH HH HL Original Image Intermediate Image Transformed Image Intermediate Image Creates Redundant Memory Access…

Block-Based Approach This approach processes rows and columns together, in a single operation, thus eliminating the need for the intermediate storage area All memory writes to this intermediate area are eliminated All memory reads from this intermediate area are eliminated Peformance is increased considerably

Block-Based Processing (1) Standard transform operations can be algebraically simplified…

Block-Based Processing (2) LL HL LH HH …to produce four fully transformed coefficients.

Other Wavelets? The Integer Haar wavelet is easy to reduce algebraically Only 4 pixels need to be read into the processor, managed, and transformed SFC’s actual SAR compression solution uses the more complex 5/3 wavelet transform 5/3 was found experimentally to preserve more quality when used to compress SAR images This transform requires 5 pixels per row/column operation, and will require 25 pixels to be fetched, managed, and transformed Reducing the 5/3 to a “block processing” approach incurrs significant overhead for tracking the current location within the image It is more feasible to seek a pipelined solution than to apply the “block processing” approach Pipelining will reduce the inefficiencies introduced when managing intermediate transform data

Prefetch/Pipeline Approach Another approach to improving performance is the pipelining of row and column transform operations For a wavelet transform, the column transform operations can begin as soon as a minimum number of rows have been transformed For the 5/3 operation, the column transform operations can begin once the first three rows have been transformed Intermediate memory transfers are greatly reduced, although not eliminated This approach will be dependent upon the specific processor and memory configuration being used for implementation

Architecture + Our board, the Nallatech BenNUEY-PCI-4E, provides opportunities for parallel processing and pipelining Nallatech BenNUEY-PCI-4E Xilinx Virtex-II Pro (2VP50) 4 MB ZBT SRAM Ethernet Connectivity Nallatech BenDATA-DD Module Xilinx Virtex-II (2V3000) 1 GB SDRAM +

Design Challenges Image data will be stored in the 1 GB SDRAM Original data can occupy up to half the total space Transformed data will occupy the other Memory is addressable as 32-bit words Each memory read/write will involve four “packed” 8-bit pixel values Row Challenge: each transform requires three pixels, but they can only be read in groups of four across a single row Column Challenge: each transform requires three pixels, but they are packed by rows, not by columns

Row Solution Prefetching/pipelining uses two 32-bit registers, WordA and WordB, which allow transform data to be prefetched so that the “pipe” is always full Prefetching/pipelining enables efficient utilization of available resources Prefetching/pipelining produces a deterministic, repeating pattern after only four operations, as shown on the following slides…

Row Operations (1) 1st Row Op 2nd Row Op 3rd Row Op 4th Row Op = value active in current operation 1st Row Op 2nd Row Op Data_In (32 bits) p0 p1 p2 p3 Data_In (32 bits) Data_In (32 bits) 1. Fetch 2 words p4 p5 p6 p7 1. Don’t fetch p4 p5 p6 p7 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Load Word A p0 p1 p2 p3 2. Preserve Word A p0 p1 p2 p3 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Load Word B p4 p5 p6 p7 3. Preserve Word B p4 p5 p6 p7 4. Perform 1st transform 4. Perform 2nd transform 3rd Row Op 4th Row Op Data_In (32 bits) Data_In (32 bits) 1. Don’t fetch p4 p5 p6 p7 1. Fetch next word p8 p9 p10 p11 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Preserve Word A p0 p1 p2 p3 2. Load Word A p8 p9 p10 p11 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Preserve Word B p4 p5 p6 p7 3. Preserve Word B p4 p5 p6 p7 4. Perform 3rd transform 4. Perform 4th transform

Row Operations (2) 5th Row Op 6th Row Op 7th Row Op 8th Row Op Etc… = value active in current operation 5th Row Op 6th Row Op Data_In (32 bits) Data_In (32 bits) 1. Don’t fetch p8 p9 p10 p11 1. Fetch next word p12 p13 p14 p15 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Preserve Word A p8 p9 p10 p11 2. Preserve Word A p8 p9 p10 p11 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Preserve Word B p4 p5 p6 p7 3. Load Word B p12 p13 p14 p15 4. Perform 5th transform 4. Perform 6th transform 7th Row Op 8th Row Op Data_In (32 bits) Data_In (32 bits) 1. Don’t Fetch p12 p13 p14 p15 1. Fetch next word p16 p17 p18 p19 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Preserve Word A p8 p9 p10 p11 2. Load Word A p16 p17 p18 p19 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Preserve Word B p12 p13 p14 p15 3. Preserve Word B p12 p13 p14 p15 Etc… 4. Perform 7th transform 4. Perform 8th transform

Column Solution Since four pixels must be read from across four columns, an efficient solution is to process four columns in parallel Rather than transforming one column completely, we will transform four columns partially For efficiency, column processing will begin as soon as three rows have been fully transformed Row processing will continue after column processing has begun!

Column Operations Operation 1: Process first coefficients for columns 0 - 3 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 2: Process first coefficients for columns 4 - 7 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 3: Process first coefficients for columns 8 - 11 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 4: Process first coefficients for columns 12 - 15 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1

Ideal Implementation An ideal implementation would use only the resources (registers and memory) available internally on the FPGA Eliminates slow interfaces between chips and external memory Provides great flexibility in how memory is managed Impractical in today’s FPGA devices Internal resources are too limited in single devices Sufficient resources across multiple devices are prohibitively expensive

Alternate Implementation 1 One alternate implementation would use only the FPGA and memory available on the BenDATA-DD module Simplifies the interface between the FPGA and the memory, since the FPGA and SDRAM are located in close physical proximity and without intermediate devices A complete implementation of wavelet compression may not fit into the single FPGA on the BenDATA-DD module

Alternate 1, Level 1 WPT Pass 1a WPT Pass 1b

Alternate 1, Level 2 WPT Pass 2a WPT Pass 2b

Alternate 1, Level 3 WPT Pass 3a WPT Pass 3b

Alternate Implementation 2 A second alternate implementation would distribute processing between multiple FPGAs (on the motherboard and the module) and between the SRAM (motherboard) and SDRAM (module) Allows larger design to be distributed among multiple devices Increases opportunities for parallel processing Increases design complexity and decreases performance Data must be shared between the two FPGAs via some transport mechanism (such as a FIFO)

Wavelet Transform Level 1 WPT Pass 1a WPT Pass 1b Row-transformed coefficients (source for column transform)

Wavelet Transform Level 2 WPT Pass 2a WPT Pass 2b

Wavelet Transform Level 3 WPT Pass 3a WPT Pass 3b

Conclusions/Suggestions A prefetch/pipelined implementation of the 5/3 wavelet transform effectively removes the need for redundant access to intermediate data This implementation can be extended to other wavelet transforms (more or less complex than the 5/3) Final implementation of prefetching/pipelining will depend upon the architecture being used, and details such as memory bus width