By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.

Slides:

Advertisements

Similar presentations

Architecture-Specific Packing for Virtex-5 FPGAs

Advertisements

FPGA Configuration. Introduction What is configuration? – Process for loading data into the FPGA Configuration Data Source Configuration Data Source FPGA.

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

ARM-DSP Multicore Considerations CT Scan Example.

Internal Logic Analyzer Final presentation-part B

Internal Logic Analyzer Final presentation-part A

Imperium Accelero 9K Group Members Ian Ferguson Nathan Liesch Luis Ramirez Mark Willson.

Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.

Programmable logic and FPGA

Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Spike Sorting Algorithm Implemented on FPGA Elad Ilan Asaf Gal Sup: Alex Zviaginstev.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:

S UB -N YQUIST S AMPLING DSP & S UPPORT C HANGE D ETECTOR M IDTERM PRESENTATION S UB -N YQUIST S AMPLING DSP & S UPPORT C HANGE D ETECTOR M IDTERM PRESENTATION.

Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.

Sub- Nyquist Sampling System Hardware Implementation System Architecture Group – Shai & Yaron Data Transfer, System Integration and Debug Environment Part.

By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin.

By: Oleg Schtofenmaher Maxim Fudim Supervisor: Walter Isaschar Characterization presentation for project Winter 2007 ( Part A)

Digital Radio Receiver Amit Mane System Engineer.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.

Prototype Test of SPring-8 FADC Module Da-Shung Su Wen-Chen Chang 02/07/2002.

Efficient FPGA Implementation of QR

Implementation of MAC Assisted CORDIC engine on FPGA EE382N-4 Abhik Bhattacharya Mrinal Deo Raghunandan K R Samir Dutt.

14-15 May,2002 EVLA Correlator Backend Functional Design Tom Morgan 1 Backend Preliminary Functional Design.

Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.

PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.

ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTEMS

Performed by: Yaron Recher & Shai Maylat Supervisor: Mr. Rolf Hilgendorf המעבדה למערכות ספרתיות מהירות הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל.

Final Presentation Annual project (Part A) Winter semesterתשע"ב (2011/12) Students: Dan Hofshi, Shai Shachrur Supervisor: Mony Orbach INS/GPS navigation.

Scientific Computing Singular Value Decomposition SVD.

High Speed Digital Systems Lab. Agenda  High Level Architecture.  Part A.  DSP Overview. Matrix Inverse. SCD  Verification Methods. Verification Methods.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

Sub-Nyquist Sampling Algorithm Implementation on Flex Rio

Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.

Performed by Greenberg Oleg Kichin Dima Winter 2010 Supervised by Moshe Mishali Inna Rivkin.

Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

Performed by: Dor Kasif, Or Flisher Instructor: Rolf Hilgendorf Jpeg decompression algorithm implementation using HLS PDR presentation Winter Duration:

Sub-Nyquist Reconstruction Characterization Presentation Winter 2010/2011 By: Yousef Badran Supervisors: Asaf Elron Ina Rivkin Technion Israel Institute.

Presenters: Genady Paikin, Ariel Tsror. Supervisors : Inna Rivkin, Rolf Hilgendorf. High Speed Digital Systems Lab Yearly Project Part A.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

By: Daniel Barsky, Natalie Pistunovich Supervisors: Rolf Hilgendorf, Ina Rivkin Characterization Sub Nyquist Implementation Optimization 11/04/2010.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Neta Peled & Hillel Mendelson Supervisor: Mike Sumszyk Annual project אביב תשס " ט.

Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

Company LOGO Project Characterization Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

1 Level 1 Pre Processor and Interface L1PPI Guido Haefeli L1 Review 14. June 2002.

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

Sub- Nyquist Sampling System Hardware Implementation System Architecture Group – Shai & Yaron Data Transfer, System Integration and Debug Environment Part.

Roman Kofman & Sergey Kleyman Neta Peled & Hillel Mendelson Supervisor: Mike Sumszyk Final Presentation of part A (Annual project)

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

IEEE n November 2012 Submission AtmelSlide 1 Project: IEEE P Working Group for Wireless Personal Area Networks (WPANs) Submission Title:

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.

PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.

Backprojection Project Update January 2002

Cellular Automata Project:

Cache Memory Presentation I

Basic Algorithms Christina Gallner

A Comparison of Field Programmable Gate

Convolution, GPS and the TigerSHARC XCORRS instr.

Presentation transcript:

By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010

 Overview  Project objectives  Hardware  Introduction to CAD tools  Detailed Data Flow through the system  Hardware utilization summary  Latencies summary  Points of possible improvement:  Implementation  Architecture  Algorithm  Selected improvement to be implemented  Timeline  Future plans  Project status  Gantt Chart

 On the short run, optimize the algorithm to use minimal hardware, in order to fit on 2 FPGA chips, while maintaining minimum latency  On the long run, determine an optimal architecture to be implemented on chip (ASIC)

 GIDEL PROCStar III card  4 x Altera Stratix III FPGA  1 GB DDR DRAM  Altera Stratix-III EP3SE260  255K Logic Elements  Maximum x18 bit multipliers *  Max. frequency - ~300MHz * In FIR mode

 In order to get acquainted to the different CAD tools in use, we have constructed a model design and ran it through the entire process until it is burned and run on the card

Expander CTF OMP DSP Pseudo Inverse Incoming Samples At 60MHz Samples are filtered and decimated to 12 channels of 20MHz, Sent to Memory & Q-Frame Q-Frame collects 70 samples, calculates Q-Frame and sends it to OMP Memory OMP calculates the support from the Q-Frame. Then, it sends it to the Pseudo Inverse Memory stored samples for later reconstruction, each with the appropriate support index Reconstruction & Support Change Detection Reconstruction reconstructs data from input samples using the pseudo- inverse SCD checks for a significant change in the support, if detected – initiates calculating a new one In Iteration Mode, samples are further filtered and decimated to 12 channels of 2MHz each iteration & sent to the CTF Samples are also sent to the SCD to check for a change in the support In iteration mode, a Q-Frame is constructed, a support is calculated and accumulated for each iteration Pseudo-Inverse recovers the columns of the support from matrix A, constructs their pseudo- inverse & sends it to the Reconstruction

Anal og Syst em + A/D 60 MHz 12 bit 60 MHz 12 bit 60 MHz 12 bit The expander (master) sends 12 20MHz slices to the CTF (slave) each cycle 10 MHz -10 MHz 10 MHz -10 MHz 10 MHz -10 MHz 30 MHz-30 MHz 30 MHz-30 MHz 30 MHz -30 MHz 30 MHz -30 MHz MHz -10 MHz 10 MHz-10 MHz 10 MHz-10 MHz 10 MHz -10 MHz 10 MHz -10 MHz 10 MHz -10 MHz 2 10 MHz-10 MHz 10 MHz-10 MHz 10 MHz -10 MHz LPF Memory CTF 12 samples 20MHz each 20 MHz sample The expander sends new 20MHz slices to the Memory each cycle cycle and to the DSP cos sin cos sin cos sin cos sin DSP 20 MHz sample

Anal og Syst em + A/D 60 MHz 12 bit 60 MHz 12 bit 60 MHz 12 bit The expander (master) sends 12 2MHz slices to the CTF (slave) each cycle Once the CTF requests for new, the expander changes and sends Memory CTF 12 samples 2MHz each new 20 MHz sample The expander sends new 20MHz slices to the Memory each cycle and to the DSP 10 MHz -10 MHz 10 MHz-10 MHz 10 MHz-10 MHz LPF 30 MHz-30 MHz cos sin DSP 20 MHz sample 10 MHz -10 MHz 10 MHz -10 MHz 10 MHz -10 MHz MHz -10 MHz 10 MHz -10 MHz 10 MHz -10 MHz MHz-10 MHz 10 MHz-10 MHz 10 MHz -10 MHz x80x(20/180)10x40x(2/180) MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 30 MHz -30 MHz 30 MHz -30 MHz cos sin cos sin 30 MHz-30 MHz cos sin LPF -1 MHz 1 MHz-1 MHz 1 MHz-1 MHz 1 MHz

 Normal:  Iterations: 180 MHz120 MHzMultipliers Normal Iteration

IterationsNormalCycles  Normal:  Iterations:

 Constructs the Q-Frame for the support calculation, and sends it to the Q-Frame Block Q-Frame Q-Frame Memory Mem A 5kbit Mem B 5kbit Controller Input Channels From Expander Vector Multiplier Support Accumulator Q-Frame entries To OMP Support Vector From OMP Support Length Vector To DSP Support Indices To DSP 3x2x18 bit 12x12x18 bit complex 7x12 bit4 bit 12 bit Conversion To Complex

 Receives a vector of bit complex samples from the Expander (Y[1..12])  Calculates in 2 clock cycles Vector Multiplier Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y 12 Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y1HY1H Y2HY2H Y3HY3H Y4HY4H Y 12 H

 On the 1 st cycle, calculates and stores the first 3 columns  Requires: 33 Complex 18  18 Complex multipliers Vector Multiplier Q 2,1 Q 1,2 Q 3,1 Q 1,3 Q 3,2 Q 2,3 y3Hy3Hy3Hy3H y2Hy2Hy2Hy2H y1Hy1Hy1Hy1H Q 1,1 y1y1y1y1 Q 2,2 Q 2,1 y2y2y2y2 Q 3,3 Q 3,2 Q 3,1 y3y3y3y3  Q 12,3 Q 12,2 Q 12,1 y 12 Memory Bank Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 12

 On the 2 nd cycle, calculates and stores the last 9 columns  Requires: 45 Complex 18  18 Complex multipliers Vector Multiplier Q 5,4 Q 4,5 Q 6,5 Q 5,6 Q 6,4 Q 4,5 Q 12,4 Q 4,12 Q 12,5 Q 5,12 Q 12,6 Q 6,12 y 12 H  y6Hy6Hy6Hy6H y5Hy5Hy5Hy5H y4Hy4Hy4Hy4H y3Hy3Hy3Hy3H y2Hy2Hy2Hy2H y1Hy1Hy1Hy1H Q 1,12  Q 1,6 Q 1,5 Q 1,4 Q 1,3 Q 1,2 Q 1,1 y1y1y1y1 Q 2,12  Q 2,6 Q 2,5 Q 2,4 Q 2,3 Q 2,2 Q 2,1 y2y2y2y2 Q 3,12  Q 3,6 Q 3,5 Q 3,4 Q 3,3 Q 3,2 Q 3,1 y3y3y3y3  Q 4,4 Q 4,3 Q 4,2 Q 4,1 y4y4y4y4  Q 5,5 Q 5,4 Q 5,3 Q 5,2 Q 5,1 y5y5y5y5  Q 6,6 Q 6,5 Q 6,4 Q 6,3 Q 6,2 Q 6,1 y6y6y6y6  Q 12,12  Q 12,6 Q 12,5 Q 12,4 Q 12,3 Q 12,2 Q 12,1 y 12 Memory Bank Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 12

 Uses 45 18x18 bit Complex Multipliers (45 DSP Half-Blocks)  Latency:  Normal Mode:  Iteration Mode:  Independent of system clock frequency!

 Calculates the signal’s support from the Q- Frame using the Orthogonal Matching Pursuit algorithm, using several iterations Support Calculation Support Merge A Matrix Memory 12x12x18 bit complex Q-Frame entries from Q-Frame 12 bit Support Vector to Q-Frame OMP Matrix Multiplier

 Initialization: Q-frame is loaded into residual matrix  1 cycle Q-Frame Residual Matrix

 Phase 1: Projection  101 cycles  x18 Complex multipliers Residual A AHAH Z

Current Support  Phase 2: Energy Calculation, Find maximum energy & Update Support  101 cycles  12 18x18 Complex multipliers Z Z1Z1 Z2Z2 Z3Z3 Z4Z4 Z5Z5 Z6Z6 Z 100 Z 101 Z1Z1 Z1HZ1H |Z 1 | 2 |Z| 2 Z1Z1 Maximum Energy

 Phase 4: Vector Orthogonalization  Number of cycles depends on iteration (on i-th iteration – 2i cycles)  12 18x18 Complex Multipliers Current SupportA V support Previous Orthogonal Vectors V support WjWj WjWj

 Phase 5: Vector Normalization  2 cycles + (square root calculation time)  12 18x18 Complex Multipliers V support V support H V support Previous Orthogonal Vectors W support

 Phase 6: Residual Matrix Update  14 cycles  x18 Complex Multipliers W support Residual W support H W support Residual

 Phase 6: Residual Matrix Energy Calculation & Stopping Condition Check  13 cycles  12 18x18 Complex Multipliers Residual Calculate Column Energy Calculate Overall Energy

 Uses x18 bit Complex Multipliers (144 DSP Half-Blocks)  Latency:  Normal Mode: ~1100 Clock Cycles  6.1 usec at 180MHz  9.1 usec at 120MHz  Iteration Mode: ~2560 Clock Cycles per iteration  14.2 usec per iteration at 180MHz  21.3 usec per iteration at 120MHz (latency is contained in Q-Frame construction latency for the next iteration, which is 35 usec per iteration)

Memory Expander DSP Support Change detector Samples Y Support Reconstructed signal Pseudo Inverse External memory Matrix A CTF Samples Y

Matrix A Support A mxn QR Decomposition  261 cycles  51 multipliers, 1 sqrt Q mxm

A QR Decomposition Q X RQTQT =  12 cycles  144 multipliers

R R R -1  156 cycles  1 multiplier, 1 divide

QTQT R -1 AtAt X=  12 cycles  144 multipliers R -1

AtAt YZ X= Memory  1 20MHz  144 multipliers Y samples

AtAt YZ X= DSP Support Change detector Pseudo Inverse Support changed

usCycles

Memory Controller A†A† CTF Expander DSP Support Change Detector Q-Frame OMP FPGA 1FPGA 2FPGA 3 73% 98% 75% Timeline New Incoming Sample Expander Delay 1.3usec Q-Frame Delay 3.5usec OMP Delay 6usec Pseudo-Inverse Delay 2.4usec Reconstruction Delay Sample ready For reconstruction

Memory Controller CTF Expander DSP Support Change Detector Q-Frame OMP  Use Matrix Multiplication Unit DSP Q-Frame OMP  Extend Q-Frame Calculation  Reconstruction using Matrix Multiplication Unit Support Change Detector 1 divide

Memory FPGA#2 - Matrix Multiplication FPGA#1 – Expander Expander Matrix multiplication unit DSP Reconstruction & Support Change Detection Pseudo-Inverse CTF Q-Frame OMP Controller

 Consider rank-1 updates for a change in the support  Consider changing the QR decomposition algorithm in the DSP: Householder modified Gramm-Schmidt  Consider another decomposition: QR LQ, SVD, etc.  Consider another the MP algorithm: OMP BMP, Convex Optimization, etc.

System Analysis:  Studying the system’s algorithm  Understanding algorithm implementation  Analyzing hardware usage & latency Locating points of possible optimization Current System Simulation  Creating Entire Current System test environment  Simulating entire current system System Optimization  Selecting optimizations to be implemented  Implementing optimizations  Simulating optimized system DONE FUTURE PRESENT

 The Memory Block consists of 2 Memory Banks, each of 12 columns 12x18x2 bits wide  Each column can be written completely in 1 cycle and independent of the other columns Memory Bank Column 2 Column 3 Column 4 Column 12 Column 1 Element 1 18 bit complex Element 2 18 bit complex Element 3 18 bit complex Element bit complex Element 4 18 bit complex 12x12x18x2 bit vector Data from Vector Multiplier 12x12x18x2 bit vector Q-Frame to OMP Back

 In iteration mode, the Support Accumulator accumulates the supports extracted from each iteration  When finished iterating, sends the complete support to the DSP block Support Accumulator D-FF 12x1 binary vector Iteration support From OMP Support Format Conversion 4 bit Support Length Vector To DSP 7  12 bit Support Indices To DSP D-FF Back