NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.

Slides:

Advertisements

Similar presentations

FPGA (Field Programmable Gate Array)

Advertisements

DSPs Vs General Purpose Microprocessors

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

MotoHawk Training Model-Based Design of Embedded Systems.

Digital Signal Processing and Field Programmable Gate Arrays By: Peter Holko.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

Processor Technology and Architecture

Department of Electrical and Computer Engineering Texas A&M University College Station, TX Abstract 4-Level Elevator Controller Lessons Learned.

CMOL overview ● CMOS / nanowire / MOLecular hybrids ● Uses combination of Micro – Nano – Nano implements regular blocks (ie memory) – CMOS used for logic,

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Xilinx at Work in Hot New Technologies ® Spartan-II 64- and 32-bit PCI Solutions Below ASSP Prices January

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

Computational Technologies for Digital Pulse Compression

VSIPL++ / FPGA Design Methodology

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Automated Design of Custom Architecture Tulika Mitra

Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.

Results – Peak Streaming Performance Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations.

Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.

J. Christiansen, CERN - EP/MIC

Adaptive beamforming using QR in FPGA Richard Walke, Real-time System Lab Advanced Processing Centre S&E Division.

1 C.H. Ho © Rapid Prototyping of FPGA based Floating Point DSP Systems C.H. Ho Department of Computer Science and Engineering The Chinese University of.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

1 Restoration of Star-Field Images Using High-Level Languages and Core Libraries Robin Bruce, Caroline Ruet, Dr Malachy Devlin, Prof Stephen Marshall.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

` Initialisation Data Transfer Termination Function 1 Function 2 Function 3 Dat a Hos t Fabric Malachy Devlin Robin Bruce, Stephen Marshall Submission.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Programmable Logic Devices

A Streaming FFT on 3GSPS ADC Data using Core Libraries and DIME-C

Parallel Algorithm Design

FPGAs in AWS and First Use Cases, Kees Vissers

Anne Pratoomtong ECE734, Spring2002

Presentation transcript:

NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level Languages Dr. Malachy Devlin, Robin Bruce and Dr. Stephen Marshall

NDA Confidential. Copyright ©2005, Nallatech.2 Overview »This presentation details the initial stages of a project whose aim is to provide developers with an FPGA-implemented implementation of the VSIPL API »Implementation of research to date has centred around the following key areas of FPGA design: »Implementation of pipelined floating-point algorithms in FPGA fabric »Use of novel high-level language tools to realise HDL »Resource-sharing between functional units »Example application given is that of a basic 4096 point floating point Pulse Compression architecture »Composed of FFT/IFFT and complex multiply

NDA Confidential. Copyright ©2005, Nallatech.3 VSIPL – The Vector, Signal and Image Processing API »VSIPL is a collaboration between industrial, academic and governmental partners »VSIPL is meant as a standard API that enables write once, run anywhere possibilities to HPEC developers. »It was developed as a C-based API targeting a single microprocessor »VSIPL implementations are based on a profile or subset of the full API »Implementations can have radically different underlying hardware and memory management systems and still allow applications to be ported between them »This makes FPGA-implemented VSIPL possible

NDA Confidential. Copyright ©2005, Nallatech.4 FPGAs & Floating Point »VSIPL is best regarded as floating-point arithmetic API, offering both single and double precision »FPGA-based floating point has come of age »High gate density of latest generation of FPGAs allows for the instantiation of multiple floating-point arithmetic modules »Up to 25GFlops/s of peak floating-point performance is possible »Need for high dynamic range and precision »Fixed-point solutions are becoming difficult to justify in terms of design time needed and designer expertise required in the face of increasing resource density »Floating point algorithms can be programmed in C and compiled to HDL code giving pipelined architectures

NDA Confidential. Copyright ©2005, Nallatech.5 Challenges with FPGA VSIPL »1.) Finite size of hardware-implemented algorithms vs. conceptually boundless vectors. »2.) Runtime reconfiguration options offered by microprocessor VSIPL. »3.) Vast scope of VSIPL API vs. limited fabric resource. »4.) Communication bottleneck arising from software-hardware partition. »5.) Difficulty in leveraging parallelism inherent in expressions presented as discrete operations.

NDA Confidential. Copyright ©2005, Nallatech.6 Architectural Possibilities Initialisation Function 1 Stub Function 2 Stub Function 3 Stub Termination Function 1 Function 2 Function 3 Initialisation Data Transfer Termination Function 1 Function 2 Function 3 Data IO Initialisation Function 1 Function 2 Function 3 Termination 3.) FPGA Master Implementation: Traditional systems are considered to be processor centric, however FPGAs now have enough capability to create FPGA centric systems. In this approach the complete application can reside on the FPGA, including program initialisation and VSIPL application algorithm. The FPGA VSIPL-based application can use the host computer as a I/O engine, or the FPGA can have direct I/O attachments. 1.) Function ‘stubs’: Each function selected for implementation on the fabric transfers data from the host to the FPGA before each operation of the function and brings it back again to the host before the next operation. This model (figure 4) is the simplest to implement, but excessive data transfer greatly hinders performance. The work presented here is based on this first model. 2.) Full Application Algorithm Implementation: Rather than utilising the FPGA to process functions separately and independently, we can utilise high level compilers to implement the application’s complete algorithm in C making calls to FPGA VSIPL C Functions. This can significantly reduce the overhead of data communications, FPGA configuration and cross function optimizations.

NDA Confidential. Copyright ©2005, Nallatech Point Pulse Compressor »Single unit can carry out three different operations »Data stored in block RAMs to which host has access »Data stays in place on fabric while different modes are executed »Host loads data and asserts control signals to start each mode »Complex Multiply reuses resources from the FFT/IFFT, its presence adding only multiplexer logic Tri-mode Functional Unit Mode 1 = FFT Mode 2 = IFFT Mode 3 = Complex Multiply RealA/ Result ImagA/ Result RealBImagBRoots_u1Roots_u2 size log2 size mode scale

NDA Confidential. Copyright ©2005, Nallatech.8 VSIPL use of HW Functions »The functional unit is accessed from the host via the FUSE API »The hardware functions can be accessed by C functions on the host »The 1D FFT, IFFT and complex multiply VSIPL functions can access the hardware in a manner which is invisible to the application programmer »Pulse compression can be performed in one of two ways: »As discrete VSIPL function calls to the hardware »As a custom function designed to make most efficient use of the hardware, leaving the data on the fabric until all operations have been performed »The second approach is taken here

NDA Confidential. Copyright ©2005, Nallatech.9 FFT Algorithm »FFT Algorithm was originally a radix-2 decimation-in-time with 3 nested loops »Only innermost loop was pipelined »1024 point pulse compression took clock cycles »The two inner loops were combined into a single loop »With inner loop pipelined 1024 point pulse compression takes only clock cycles, a reduction by a factor of 9.5x »With the algorithm maximally pipelined, the execution time increases linearly with the data vectors »Contrasts with the exponential increases for the 3-loop case »The maximum size of pulse compression possible is limited only by the size of the memory structures used to store data. In this case we are limited to 4096-point, requiring 120 BlockRAMs

NDA Confidential. Copyright ©2005, Nallatech.10 Resource Consumption »Size of FFT limited only by available blockRAM »Using a different approach to memory access could allow for practically limitless FFTs, complex multiplies and pulse compression. »Implemented on a Virtex II 6000 FPGA on Nallatech BenNUEY PCI Card

NDA Confidential. Copyright ©2005, Nallatech.11 Pipelined Hardware vs. PC Host »Host-to-fabric communication delays dominate total calculation time. »Illustrates need to limit host-to-fabric communication »Without pipelining to limit it, software calculation time increases exponentially with vector size »Shows benefits of custom hardware to implement functions »Design is clocked at 120MHz »HW Only: »Data transfer time not included »HW&Host: » HW time + data transfer » Host Only: »Software only pulse compression

NDA Confidential. Copyright ©2005, Nallatech.12 Conclusion & Results »Multiple VSIPL functions can be implemented in a single unit that capitalises on pipelining possibilities and uses resource sharing to minimise hardware requirements »Rapid development of other VSIPL functions is possible »Scalable architectures can tackle the problem of large vector sizes »Floating-point algorithms can be rapidly developed and efficiently implemented »4096-point single-precision floating-point pulse compression in 641.5us