A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Slides:



Advertisements
Similar presentations
MEMORY popo.
Advertisements

DSPs Vs General Purpose Microprocessors
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems part 5: Special and weird ‘processor’
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
NDG-L37-41Introduction to ASIC Design1 Design of a Simple Customizable Microprocessor * Chapter 7 and 15, “Digital System Design and Prototyping”  SIMP.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
The Logic Machine We looked at programming at the high level and at the low level. The question now is: How can a physical computer be built to run a program?
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Introduction to Systems Architecture Kieran Mathieson.
Chapter 7. Register Transfer and Computer Operations
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Memory Organization.
Programmable logic and FPGA
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Basic Computer Organization CH-4 Richard Gomez 6/14/01 Computer Science Quote: John Von Neumann If people do not believe that mathematics is simple, it.
Eye-RIS. Vision System sense – process - control autonomous mode Program stora.
Engineering 1040: Mechanisms & Electric Circuits Fall 2011 Introduction to Embedded Systems.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Chapter 5 Basic Processing Unit
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Computers Are Your Future Eleventh Edition Chapter 2: Inside the System Unit Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall1.
Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)
Intermediate 2 Computing Computer structure. Organisation of a simple computer.
Page 1 Digital Signal Processing at 1GHz in a Field-Programmable Object Array Dirk Helgemo Chief Architect MathStar, Inc. 24 September 2003.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Computer Organization & Assembly Language © by DR. M. Amer.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
ALU (Continued) Computer Architecture (Fall 2006).
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Academic PowerPoint Computer System – Architecture.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
THE MICROPROCESSOR A microprocessor is a single chip of silicon that performs all of the essential functions of a computer central processor unit (CPU)
ZHAO(176/MAPLD2004)1 FFT Mapping on Mathstar’s FPOA FilterBuilder Platform MathStar, Inc. Sept 2004.
® Virtex-E Extended Memory Technical Overview and Applications.
Question What technology differentiates the different stages a computer had gone through from generation 1 to present?
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Capability of processor determine the capability of the computer system. Therefore, processor is the key element or heart of a computer system. Other.
1 Memory Hierarchy (I). 2 Outline Random-Access Memory (RAM) Nonvolatile Memory Disk Storage Suggested Reading: 6.1.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Logic Gates Dr.Ahmed Bayoumi Dr.Shady Elmashad. Objectives  Identify the basic gates and describe the behavior of each  Combine basic gates into circuits.
Dr. ClincyLecture Slide 1 CS Chapter 4 (Sec 5.1 &5.2) 1 of 5 Dr. Clincy Professor of CS.
UniBoard Meeting, October 12-13th 2010 Jonathan Hargreaves, JIVE EVN Correlator Design UniBoard Meeting, October th 2010 Contract no
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ read/write and clock inputs Sequence of control signal combinations.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Computer Systems Nat 4/5 Computing Science Computer Structure:
Dr.Ahmed Bayoumi Dr.Shady Elmashad
Embedded Systems Design
Instructor: Dr. Phillip Jones
Course Name: Computer Application Topic: Central Processing Unit (CPU)
The Xilinx Virtex Series FPGA
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Final Project presentation
The Xilinx Virtex Series FPGA
ADSP 21065L.
Programmable logic and FPGA
Presentation transcript:

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions

MAPLD 2005/206Chiang 2 Presentation Outline Space born signal processing tasks FPOA architecture highlights programmability and expandability System partition on FPOA device Spatial processing - 5x5 filter solution Temporal processing – motion estimation Internal bus and I/O throughput Resource utilization and future expansion

MAPLD 2005/206Chiang 3 A System of Digital Signal Processing Data Extraction Input Data Spatial or Temporal Processing Frequency or Time domain Processing Feature Extraction Characterization mux/de-mux Average filter min/max select spatial edge filter temporal difference filter time domain low/high/bandpass filter frequency transformation frequency domain low/high/bandpass filter apply equation that defines feature checking threshold analyze and characterize signals

MAPLD 2005/206Chiang 4 Processing Requirements High computation requirement on the following basic operations: add/sub and mul/mac, Mixed control functions such as loop control and decision making High I/O bandwidth to enable balanced processing vs. data input/output Large and fast temporary memory space to facilitate real-time processing Fast programmable and direct data transfer enables massive parallel processing

MAPLD 2005/206Chiang 5 FPOA Architecture Summary Heterogeneous Array of 16-bit Silicon Objects ­MAC, ALU, Truth Tables, Register File, Internal RAM ­Single Clock Cycle Execution for All Objects Homogeneous 2-Layer Programmable Interconnect Mesh Tightly Integrated Data and Control Flow Integrated DDRII RLDRAM & SRAM Controllers High Speed I/O at Device Boundaries: SerDes, LVDS, HSTL

MAPLD 2005/206Chiang 6 Reconfigurable Interconnect Network Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits Nearest Neighbors ­Range = 1 (N/E/S/W + diagonal) Party Lines ­Single cycle range = hop to 3 (skip 1GHz ­Extra clock cycles for digital retiming 1 extra  25-object neighborhood More clock cycles  entire chip

MAPLD 2005/206Chiang 7 FPOA Solution Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth 400 Silicon Objects running at 1 GHz ­ALU: add/sub, and combinational logic ­MAC: mul/mac ­Register File (RF): fast distributed data storage ­Internal RAM (IRAM): intermediate data storage Party lines and muxes to support flexible internal bus as well as dedicated connections

MAPLD 2005/206Chiang 8 Example FPOA Partition

MAPLD 2005/206Chiang 9 5x5 Convolution Filter Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4] for i = 2; i < m – 3; i++ for j = 2; j < n – 3; j++ temp = 0; for k = -2; k < 3; k++ for l = -2; l < 3; l++ temp = D[i+k, j+l] * W[k+2, l+2] + temp end_of_l end_of_k Y[i, j] = temp; end_of_j end_of_i

MAPLD 2005/206Chiang 10 Computation Requirements Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample The whole convolution filter operation requires 25 * M * N MAC operations With a standard 720x480 image data and 30 frames per second, the convolution filter operation requires 259 MMAC per second

MAPLD 2005/206Chiang 11 Data Storage 2D data storage in a 1D linear memory where bit word can be accessed concurrently Example of an 8x8 2D matrix stored in a 1D memory

MAPLD 2005/206Chiang 12 Data Access Analysis Samples are stored in the external memory with slower access speed Maximize data bandwidth by accessing 4 words at a time Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation

MAPLD 2005/206Chiang 13 Data Processing Analysis Note 1: with a 5x5 filter the first two rows and columns are skipped Note 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52

MAPLD 2005/206Chiang 14 FPOA Solution Temporary data storage ­5 RFs, 3 ALUs Data access control ­3 ALUs Multiplier ­4 MACs Adder Tree ­9 ALUs Temporary Results ­2 RFs, 1 IRAM, 2 ALUs

MAPLD 2005/206Chiang 15 5x5 Convolution Filter Performance FPOA Resources ­ALU:17 ­RF:7 ­MAC:4 ­IRAM:1 ­Total: 28 SOs + 1 IRAM Data throughput ­20 results every 125 cycles

MAPLD 2005/206Chiang 16 Motion Estimation Identify the movement of a similar pattern over time The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7] sum = 0; for i = 0 to 7 for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp) end_of_j end_of_i

MAPLD 2005/206Chiang 17 SAD Computation Dataflow 3 cycles throughput Generates two partial sums of positive differences

MAPLD 2005/206Chiang 18 SAD Performance FPOA Resources ­ALU:35 ­RF:1 ­Total: 36 SOs Data throughput ­24 cycles per 8x8 block

MAPLD 2005/206Chiang 19 Internal System Bus Link all processing modules and the external host to the external memory for data accesses to the external system memory Host controlled round-robin access from module to module User defined package format to utilize the 16-bit party line and minimize the access overhead

MAPLD 2005/206Chiang 20 System Bus Implementation

MAPLD 2005/206Chiang 21 System Bus Performance FPOA Resources ­ALU: 20 Cycles ­XRAM read:4 cycles ­XRAM write:4 cycles ­Module switch:10 cycles

MAPLD 2005/206Chiang 22 Performance of an Example Space Satellite Application Processing Throughput ­About 10 Million Samples per second FPOA Resources (% of a device with 400 SOs and running at 400 MHz) ­Cycle utilization: 21% ­SO utilization:51% ­IRAM utilization:25% ­XRAM b/w:49% (100 MHz DDR RLDRAM)