03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Slides:

Advertisements

Similar presentations

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Advertisements

Improving Placement under the Constant Delay Model Kolja Sulimma 1, Ingmar Neumann 1, Lukas Van Ginneken 2, Wolfgang Kunz 1 1 EE and IT Department University.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

System Development. Numerical Techniques for Matrix Inversion.

Robert Barnes Utah State University Department of Electrical and Computer Engineering Thesis Defense, November 13 th 2008.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.

Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

RAW 2014 Over-Clocking of Linear Projection Designs Through Device Specific Optimisations Rui Policarpo Duarte 1, Christos-Savvas Bouganis

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

Ch.9 CPLD/FPGA Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Efficient FPGA Implementation of QR

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.

Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Adaptive beamforming using QR in FPGA Richard Walke, Real-time System Lab Advanced Processing Centre S&E Division.

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.

Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Data Word Length Reduction for Low- Power DSP Software Kyungtae Han March 24, 2004.

Defect-tolerant FPGA Switch Block and Connection Block with Fine-grain Redundancy for Yield Enhancement Anthony J. YuGuy G.F. Lemieux August 25, 2005.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Floating-Point FPGA (FPFPGA)

Ph.D. in Computer Science

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Centar ( Global Signal Processing Expo

Approximate Fully Connected Neural Network Generation

Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara

Presentation transcript:

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010

03/12/20102 Outline Introduction Literature review PolyFSA architecture Architecture analysis  Area analysis  Error analysis  Performance analysis Contributions Future work

03/12/20103 Outline Introduction Literature review PolyFSA architecture Architecture analysis  Area analysis  Error analysis  Performance analysis Contributions Future work

03/12/20104 Kalman filters for Spacecraft navigation

03/12/20105 Kalman filters

03/12/20106 Research overview An FPGA based Polymorphic systolic array architecture is proposed to accelerate Kalman filters - Portions of this architecture can be reused for other applications during run-time A comprehensive architecture analysis is presented. Results are presented in terms of area savings for varying performance and precision error.

03/12/20107 Outline Introduction Literature review PolyFSA architecture Architecture analysis  Area analysis  Error analysis  Performance analysis Contributions Future work

03/12/20108 Hardware design for Kalman filters - Systolic arrays Yeh [7], M. Lu [8] and P. Rao [9] proposed systolic array architectures for Kalman filters based on Faddeev algorithm Cardoso et. al [11] proposed a hardware software co-processor system  Profiling is used to guide partitioning by designer  C2H [12] tool from Altera used to generate RTL designs But these architectures are not scalable. Some efforts [15-20] target individual linear algebra operations, like matrix inverse.

03/12/20109 Error analysis Initial efforts [28-35] were targeted towards analyzing variable precision fixed-point arithmetic Constantinides [36-45] proposed multiple ideas towards error analysis for fixed-point arithmetic Availability of FPGAs has caused a surge in work towards developing variable precision architectures, especially in the floating point domain [46-53]

03/12/ Performance and area analysis Existing performance and area estimation approaches target a parameter-specific architecture [72] Parameters include:  Overall data path width  Memory size  Number of processing elements Proposed research is also parameter-specific, but looks at latency, precision and input rates of floating point arithmetic units

03/12/ Outline Introduction Literature review PolyFSA architecture  Application analysis  Mapping to Systolic array  Architecture details Architecture analysis Contributions Future work

03/12/ Extended Kalman Filter

03/12/ Faddeev algorithm Faddeev algorithm is a method for efficiently computing the Schur complement (D - CA -1 B) Given matrices A,B,C,D, arrange in matrix M as: Reduce to row echelon form and D-CA -1 B will result in the lower right corner D-CA -1 B

03/12/ Faddeev algorithm

03/12/ Faddeev algorithm – Single node Boundary nodeInternal node

03/12/ Mapping to systolic array Simplify data flow Mapping to 1-D Systolic array Folding to make systolic array scalable

03/12/ Architecture details for boundary PE Details for internal PE are similar

03/12/ Control flow

03/12/ Results Target FPGA – Xilinx Virtex 4 SX35 Test case is derived from [Ronnback-2000] Performance is compared against a software implementation on a Virtutech Simics PowerPC 750 simulator (Thanks: Rob Barnes [79])

03/12/ Performance of proposed PolyFSA Overall execution time of EKF on PolyFSA based system architecture and PowerPC Estimated execution of Faddeev algorithm for varying number of PEs and Faddeev Parameters

03/12/ Outline Introduction Literature review PolyFSA architecture Architecture analysis  Area analysis  Error analysis  Performance analysis Contributions Future work

03/12/ Architecture analysis During design time, each PE in the proposed PolyFSA is derived for best performance and with highest precision QUESTION: By allowing for degradation in performance and/or tolerating precision error, can we reconfigure the existing PE with a set of smaller PEs?

03/12/ Design parameters that can be varied Precision of  Adder unit (madd)  Multiplier unit (mmul)  Divider unit (mdiv) Latency of  Adder unit (LatAdd)  Multiplier unit (LatMul)  Divider unit (LatDiv) Input rate of the divider (c_rate)

03/12/ Area analysis – Adder unit

03/12/ Area analysis – Multiplier unit

03/12/ Area analysis – Divider unit

03/12/ Area analysis – Divider unit

03/12/ Error analysis – Top-level flow

03/12/ Faddeev algorithm - Error vs Precision

03/12/ Error analysis for EKF

03/12/ EKF – Area Savings vs Error

03/12/ Performance analysis Major portion of execution time

03/12/ Calculation of T faddeev Execution time of Faddeev algorithm on the proposed PolyFSA is computed using a simulation model We are interested in observing the impact of performance degradation on resource utilization Results are shown for overall execution of EKF

03/12/ Performance analysis – Vary latency

03/12/ Performance analysis – Vary c_rate

03/12/ Area versus Performance

03/12/ D Pareto curves

03/12/ Summary An FPGA based Polymorphic Faddeev Systolic Array (PolyFSA) architecture is proposed to accelerate the compute-intensive kernels of Kalman filters. Hierarchical analysis of the error introduced in results of Kalman filter computations due to reduction in precision is presented. Simulation model to estimate the overall execution time of the Kalman filter algorithm is proposed. Results of architecture analysis are presented in terms of pareto curves.

03/12/ Future work Proposed methodology – architecture design supported by analysis – can be applied to design for other applications Design goals can be extended to incorporate Power consumption Design parameters can be extended to include other options – Implementation type, FPGA family type etc.