NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level Languages Dr. Malachy Devlin, Robin Bruce and Dr. Stephen Marshall
NDA Confidential. Copyright ©2005, Nallatech.2 Overview »This presentation details the initial stages of a project whose aim is to provide developers with an FPGA-implemented implementation of the VSIPL API »Implementation of research to date has centred around the following key areas of FPGA design: »Implementation of pipelined floating-point algorithms in FPGA fabric »Use of novel high-level language tools to realise HDL »Resource-sharing between functional units »Example application given is that of a basic 4096 point floating point Pulse Compression architecture »Composed of FFT/IFFT and complex multiply
NDA Confidential. Copyright ©2005, Nallatech.3 VSIPL – The Vector, Signal and Image Processing API »VSIPL is a collaboration between industrial, academic and governmental partners »VSIPL is meant as a standard API that enables write once, run anywhere possibilities to HPEC developers. »It was developed as a C-based API targeting a single microprocessor »VSIPL implementations are based on a profile or subset of the full API »Implementations can have radically different underlying hardware and memory management systems and still allow applications to be ported between them »This makes FPGA-implemented VSIPL possible
NDA Confidential. Copyright ©2005, Nallatech.4 FPGAs & Floating Point »VSIPL is best regarded as floating-point arithmetic API, offering both single and double precision »FPGA-based floating point has come of age »High gate density of latest generation of FPGAs allows for the instantiation of multiple floating-point arithmetic modules »Up to 25GFlops/s of peak floating-point performance is possible »Need for high dynamic range and precision »Fixed-point solutions are becoming difficult to justify in terms of design time needed and designer expertise required in the face of increasing resource density »Floating point algorithms can be programmed in C and compiled to HDL code giving pipelined architectures
NDA Confidential. Copyright ©2005, Nallatech.5 Challenges with FPGA VSIPL »1.) Finite size of hardware-implemented algorithms vs. conceptually boundless vectors. »2.) Runtime reconfiguration options offered by microprocessor VSIPL. »3.) Vast scope of VSIPL API vs. limited fabric resource. »4.) Communication bottleneck arising from software-hardware partition. »5.) Difficulty in leveraging parallelism inherent in expressions presented as discrete operations.
NDA Confidential. Copyright ©2005, Nallatech.6 Architectural Possibilities Initialisation Function 1 Stub Function 2 Stub Function 3 Stub Termination Function 1 Function 2 Function 3 Initialisation Data Transfer Termination Function 1 Function 2 Function 3 Data IO Initialisation Function 1 Function 2 Function 3 Termination 3.) FPGA Master Implementation: Traditional systems are considered to be processor centric, however FPGAs now have enough capability to create FPGA centric systems. In this approach the complete application can reside on the FPGA, including program initialisation and VSIPL application algorithm. The FPGA VSIPL-based application can use the host computer as a I/O engine, or the FPGA can have direct I/O attachments. 1.) Function ‘stubs’: Each function selected for implementation on the fabric transfers data from the host to the FPGA before each operation of the function and brings it back again to the host before the next operation. This model (figure 4) is the simplest to implement, but excessive data transfer greatly hinders performance. The work presented here is based on this first model. 2.) Full Application Algorithm Implementation: Rather than utilising the FPGA to process functions separately and independently, we can utilise high level compilers to implement the application’s complete algorithm in C making calls to FPGA VSIPL C Functions. This can significantly reduce the overhead of data communications, FPGA configuration and cross function optimizations.
NDA Confidential. Copyright ©2005, Nallatech Point Pulse Compressor »Single unit can carry out three different operations »Data stored in block RAMs to which host has access »Data stays in place on fabric while different modes are executed »Host loads data and asserts control signals to start each mode »Complex Multiply reuses resources from the FFT/IFFT, its presence adding only multiplexer logic Tri-mode Functional Unit Mode 1 = FFT Mode 2 = IFFT Mode 3 = Complex Multiply RealA/ Result ImagA/ Result RealBImagBRoots_u1Roots_u2 size log2 size mode scale
NDA Confidential. Copyright ©2005, Nallatech.8 VSIPL use of HW Functions »The functional unit is accessed from the host via the FUSE API »The hardware functions can be accessed by C functions on the host »The 1D FFT, IFFT and complex multiply VSIPL functions can access the hardware in a manner which is invisible to the application programmer »Pulse compression can be performed in one of two ways: »As discrete VSIPL function calls to the hardware »As a custom function designed to make most efficient use of the hardware, leaving the data on the fabric until all operations have been performed »The second approach is taken here
NDA Confidential. Copyright ©2005, Nallatech.9 FFT Algorithm »FFT Algorithm was originally a radix-2 decimation-in-time with 3 nested loops »Only innermost loop was pipelined »1024 point pulse compression took clock cycles »The two inner loops were combined into a single loop »With inner loop pipelined 1024 point pulse compression takes only clock cycles, a reduction by a factor of 9.5x »With the algorithm maximally pipelined, the execution time increases linearly with the data vectors »Contrasts with the exponential increases for the 3-loop case »The maximum size of pulse compression possible is limited only by the size of the memory structures used to store data. In this case we are limited to 4096-point, requiring 120 BlockRAMs
NDA Confidential. Copyright ©2005, Nallatech.10 Resource Consumption »Size of FFT limited only by available blockRAM »Using a different approach to memory access could allow for practically limitless FFTs, complex multiplies and pulse compression. »Implemented on a Virtex II 6000 FPGA on Nallatech BenNUEY PCI Card
NDA Confidential. Copyright ©2005, Nallatech.11 Pipelined Hardware vs. PC Host »Host-to-fabric communication delays dominate total calculation time. »Illustrates need to limit host-to-fabric communication »Without pipelining to limit it, software calculation time increases exponentially with vector size »Shows benefits of custom hardware to implement functions »Design is clocked at 120MHz »HW Only: »Data transfer time not included »HW&Host: » HW time + data transfer » Host Only: »Software only pulse compression
NDA Confidential. Copyright ©2005, Nallatech.12 Conclusion & Results »Multiple VSIPL functions can be implemented in a single unit that capitalises on pipelining possibilities and uses resource sharing to minimise hardware requirements »Rapid development of other VSIPL functions is possible »Scalable architectures can tackle the problem of large vector sizes »Floating-point algorithms can be rapidly developed and efficiently implemented »4096-point single-precision floating-point pulse compression in 641.5us