Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.

Slides:

Advertisements

Similar presentations

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Advertisements

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 ECE369 Chapter 3. 2 ECE369 Multiplication More complicated than addition –Accomplished via shifting and addition More time and more area.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

University of Jordan Computer Engineering Department CPE 439: Computer Design Lab.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

GPGPU platforms GP - General Purpose computation using GPU

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Prepared by: Hind J. Zourob Heba M. Matter Supervisor: Dr. Hatem El-Aydi Faculty Of Engineering Communications & Control Engineering.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.

EECE **** Embedded System Design

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

1 VERILOG Fundamentals Workshop סמסטר א ' תשע " ה מרצה : משה דורון הפקולטה להנדסה Workshop Objectives: Gain basic understanding of the essential concepts.

1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”

Fixed-Point Arithmetics: Part II

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

Automated Design of Custom Architecture Tulika Mitra

The Queen’s Tower Imperial College London South Kensington, SW7 28th Jan 2007 | Ashley Brown Profiling floating point value ranges for reconfigurable implementation.

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

8-1 Embedded Systems Fixed-Point Math and Other Optimizations.

1 C.H. Ho © Rapid Prototyping of FPGA based Floating Point DSP Systems C.H. Ho Department of Computer Science and Engineering The Chinese University of.

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

ESPL 1 Wordlength Optimization with Complexity-and-Distortion Measure and Its Application to Broadband Wireless Demodulator Design Kyungtae Han and Brian.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Automatic Evaluation of the Accuracy of Fixed-point Algorithms Daniel MENARD 1, Olivier SENTIEYS 1,2 1 LASTI, University of Rennes 1 Lannion, FRANCE 2.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir Under the guidance of: Prof. M Balakrishnan Prof. Kolin.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Dynamic.

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.

1 University of Jordan Computer Engineering Department CPE 439: Computer Design Lab.

CORDIC Based 64-Point Radix-2 FFT Processor

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Hardware/Software Communication Middleware for Data Adaptable Embedded Systems Sachidanand Mahadevan, Vijay Shankar Gopinath, Roman Lysecky, Jonathan Sprinkle,

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Floating Point Arithmetic – Part I

Hardware Description Language

Floating-Point FPGA (FPFPGA)

Selective Code Compression Scheme for Embedded System

Application-Specific Customization of Soft Processor Microarchitecture

FPGAs in AWS and First Use Cases, Kees Vissers

Improving java performance using Dynamic Method Migration on FPGAs

Hardware Description Language

Hardware Description Language

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Hardware Description Language

Data Wordlength Reduction for Low-Power Signal Processing Software

Hardware Description Language

Dynamic Hardware/Software Partitioning: A First Approach

Hardware Description Language

Automatic Tuning of Two-Level Caches to Embedded Applications

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Tucson, AZ USA {saldanha,

Roman Lysecky, University of Arizona 2 Introduction Traditional HW/SW Partitioning  Benefits of HW/SW Partitioning  Speedup of 2X to 10X  Speedup of 1000X possible  Energy reduction of 25% to 95%  HW/SW Partitioning Challenges  Limited support for pointers  Limited support for dynamic memory allocation  Limited support for function recursion  Very limited support for floating- point operations Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 3 Introduction Floating Point Software Applications  Floating Point Representation  Pros  IEEE standard 754  Convenience - supported within most programming languages  C, C++, Java, etc.  Cons  Partitioning floating point kernels directly to hardware requires:  Large area resources  Multi-cycle latencies  Alternatively, can use fixed point representation to support real numbers void Reference_IDCT(short* block) { int i, j, k, v; float part_prod, tmp[64]; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_prod = 0.0; for (k=0; k<8; k++) { part_prod+=c[k][j]*block[8*i+k]; } tmp[8*i+j] = part_prod; }... } SE (8 bits)M (23 bits) Single Precision Floating Point:

Roman Lysecky, University of Arizona 4 Introduction Fixed Point Software Applications void Reference_IDCT(short* block) { int i, j, k, v; float part_prod, tmp[64]; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_prod = 0.0; for (k=0; k<8; k++) { part_prod+=c[k][j]*block[8*i+k]; } tmp[8*i+j] = part_prod; }... } I (12 bits)F (20 bits) Fixed Point (32.20): typedef long fixed; #define PRECISION_AMOUNT 16 void Reference_IDCT(short* block) { int i, j, k, v; fixed part_prod, tmp[64]; long long prod; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_product = 0; for (k=0; k<8; k++) { prod=c[k][j]*( ((fixed)block[8*i+k]) <<PRECISION_AMOUNT ); part_prod += prod >>(PRECISION_AMOUNT*2)); } tmp[8*i+j] = part_prod; }... }  Fixed Point Representation  Pros  Simple and fast hardware implementation  Mostly equivalent to integer operations  Cons  No direct support within most programming languages  Requires application to be converted to fixed point representation

Roman Lysecky, University of Arizona 5 Software Application (C/C++) Introduction Converting Floating Point to Fixed Point  Converting Floating Point SW to Fixed Point SW  Manually or automatically convert software to utilize fixed point representation  Need to determine appropriate fixed point representation Software Application (Fixed) Application Profiling Critical Kernels Partitioning HWSW Software Application (Float) Float to Fixed Conversion

Roman Lysecky, University of Arizona 6 Software Application (C/C++) Introduction Converting Floating Point to Fixed Point  Automated Tools for Converting Floating Point to Fixed Point  fixify - Belanovic, Rupp [RSP 2005]  Statistical optimization approach to minimize signal to quantization noise (SQNR) of fixed point code  FRIDGE - Keding et al. [DATE 1998]  Designer specified annotations on key fixed point values can be interpolated to remaing code  Cmar et al. [DATE 1999]  Annotate fixed point values with range requirements  Iterative designer guided simulation framework to optimize implementation  Menard et al. [CASES 2002], Kum et al. [ICASSP 1999]  Conversion for fixed-point DSP processors Software Application (Fixed) Application Profiling Critical Kernels Partitioning HWSW Software Application (Float) Float to Fixed Conversion

Roman Lysecky, University of Arizona 7 HW Software Application (C/C++) Introduction Converting Floating Point to Fixed Point  Converting Floating Point SW to Fixed Point HW  Convert resulting floating point hardware to fixed point software to utilize fixed point representation  Shi, Brodersen [DAC 2004]  Cmar et al. [DATE 1999]  Must still convert software to fixed point representation Application Profiling Critical Kernels (Float) Partitioning SW (C/Matlab) SW (Float) HW (Fixed) Float to Fixed Conversion SW (Fixed)

Roman Lysecky, University of Arizona 8 Partitioning Floating Point SW to Fixed Point HW Separate Floating Point and Fixed Point Domains  Proposed Partitioning for Floating Point SW to Fixed Point HW  Separate computation into floating point and fixed point domains  Floating Point Domain  Processor (SW), Caches, and Memory  All values in memory will utilize floating point representation  Fixed Point Domain  HW Coprocessors  Float-to-Fixed and Fixed-to-Float converters at boundary between SW/Memory and HW will perform conversion µPµP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 9 Partitioning Floating Point SW to Fixed Point HW Separate Floating Point and Fixed Point Domains  Potential Benefits  No need to re-write initial floating point software  Final software can utilize floating point  Efficient fixed point implementation  Can treat floating point values as integers during partitioning  Still requires determining the appropriate fixed point representation  Can be accomplished using existing methods or directly specified by designer HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation

Roman Lysecky, University of Arizona 10 Partitioning Floating Point SW to Fixed Point HW Float-to-Fixed and Fixed-to-Float Converters  Float-to-Fixed and Fixed-to-Float Converters  Implemented as configurable Verilog modules  Configurable Floating Point Options:  FloatSize  MantissaBits  ExponentBits  Configurable Fixed Point Options:  FixedSize  RadixPointSize  RadixPoint  RadixPoint can be implemented as input or parameter RadixPoint RadixPointSize Normal Cases Zero Float Fixed Normal Shift Calc Shifter OverflowException FixedSize SE M Dir Amount - NormalCases FloatSize Special Cases Overflow Calc

Roman Lysecky, University of Arizona 11 Partitioning Floating Point SW to Fixed Point HW Coprocessor Interface  Hardware Coprocessor Interface  Integrates Float-to-Fixed and Fixed- to-Float converters with memory interface  All values read from memory are converted through Float-to-Fixed converter  Integer: IntDataIn  Fixed: FixedDataIn  Separate outputs for integer and fixed data  Integer: WrInt, IntDataOut  Fixed: WrFixed, FixedDataOut HW Coprocessor AddrBE DataOut Rd DataIn WrFixedIntDataOut Wr FixedDataOut IntDataIn FixedDataIn WrInt Fixed- to-Float Float-to- Fixed

Roman Lysecky, University of Arizona 12 Partitioning Floating Point SW to Fixed Point HW Partitioning Tool Flow  HW/SW Partitioning of Floating Point SW to Fixed Point HW  Kernels initially partitioned as integer implementation  Synthesis annotations used to identify floating point values HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation module Coprocessor (Clk, Rst, Addr, BE, Rd, Wr, DataOut, DataIn); input Clk, Rst; output [31:0] Addr; output BE, Rd, Wr; output signed [31:0] DataOut; input signed [31:0] DataIn; // syn_fixed_point (p:SP) reg signed [31:0] p; reg signed [31:0] c1; Clk) begin // syn_fixed_point (p:SP, DataIn:SP) p <= p * DataIn + c1; end endmodule

Roman Lysecky, University of Arizona 13 Partitioning Floating Point SW to Fixed Point HW Partitioning Tool Flow  HW/SW Partitioning of Floating Point SW to Fixed Point HW  Fixed point registers, computations, and memory accesses converted to specified representation HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation module Coprocessor (Clk, Rst, Addr, BE, Rd, Wr, DataOut, DataIn); input Clk, Rst; output [31:0] Addr; output BE, Rd, Wr; output signed [31:0] DataOut; input signed [31:0] DataIn; // syn_fixed_point (p:SP) reg signed [31:0] p; reg signed [31:0] c1; Clk) begin // syn_fixed_point (p:SP, DataIn:SP) p <= p * DataIn + c1; end endmodule module Coprocessor (Clk, Rst, Addr, BE, Rd, WrInt, WrFixed, IntDataOut, FixedDataOut, IntDataIn, FixedDataIn);... // Fixed point register reg signed [FixedSize-1:0] p; // Integer register reg signed [31:0] c1; Clk) begin // Fixed point multiplication and addition // with conversion from integer to fixed // point p >> RadixPoint) + (c1 << RadixPoint); end endmodule

Roman Lysecky, University of Arizona 14 Partitioning Floating Point SW to Fixed Point HW Experimental Results  Experimental Setup  250 MHz MIPS processor with floating point support  Xilinx Virtex-5 FPGA  HW coprocessors execute at maximum frequency achieved by Xilinx ISE 9.2  Benchmarks  MPEG2 Encode/Decode (MediaBench)  Epic (MediaBench)  FFT/IFFT (MiBench)  All applications require significant floating point operations  Partition both integer and floating point kernels µPµP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 15 Partitioning Floating Point SW to Fixed Point HW Experimental Results  Floating Point and Fixed Point Representations  Utilized fixed point representation that provide identical results as software floating point implementation  MPEG2 Encode/Decode (MediaBench)  Float: integer (memory), single precision (computation)  Fixed: 32-bit, radix of 20 (12.20)  Epic (MediaBench)  Float: single precision (memory), double precision (computation)  Fixed: 64-bit, radix of 47 (17.47)  FFT/IFFT (MiBench)  Float: single precision (memory), double precision (computation)  Fixed: 51-bit, radix of 30 (21.30) µPµP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 16 Partitioning Floating Point SW to Fixed Point HW Experimental Results – Float-to-Fixed and Fixed-to-Float Converters  Fixed-to-Float and Float-to-Fixed Converter Performance (RadixPoint Parameter vs. Input)  Float-to-Fixed ( RadixPoint Parameter ):  9% faster and 10% fewer LUTs compared to input version  Fixed-to-Float ( RadixPoint Parameter ):  25% faster but requires 30% more LUTs than input version µPµP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 17 Partitioning Floating Point SW to Fixed Point HW Experimental Results – Application Speedup  Application Speedup  RadixPoint Parameter Implementation:  Average speedup of 4.4X  Maximum speedup of 6.8X (fft/ifft)  RadixPoint Input Implementation:  Average speedup of 4.0X  Maximum speedup of 6.2X (fft/ifft) µPµP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 18 Conclusions  Conclusions  Presented a new partitioning approach for floating point software applications  No need to re-write initial floating point software  Hardware coprocessors utilize efficient fixed point implementation  Can treat floating point values as integers during partitioning  Developed efficient, configurable Float-to-Fixed and Fixed-to-Float hardware converters  Implemented in Verilog with both parameter and input options for specifying RadixPoint  Developed semi-automated HW/SW partitioning approach for floating point applications  Achieves average application speedup of 4.4X (max of 6.8X) compared to floating point software implementation  HW coprocessor area requirements similar to integer based coprocessor implementation

Roman Lysecky, University of Arizona 19 µPµP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN Current and Future Work  Current Work  Dynamically adaptable fixed-point coprocessors  Float-to-Fixed and Fixed-to-Float converters opens door to dynamically adapting fixed point representation at runtime  RadixGen Component  Responds to various overflows and dynamically adjusts RadixPoint  Float-to-Fixed conversion overflow  Integer-to-Fixed conversion overflow  Arithmetic overflow  Initial results achieve similar performance speedups compared to RadixPoint input implementation µPµP I$ D$ Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN Coprocessor RadixGen Arithmetic Conv. Integer

Roman Lysecky, University of Arizona 20 Current and Future Work  Future Work  Optimization of fixed point coprocessor implementation  Utilize multiple fixed point representation within single computation  Reduce area, improve performance, or reduce power?  Integrating proposed methodology with existing high-level synthesis tools  Further developing dynamically adaptable fixed-point representation  Can dynamically adaptable fixed point representation provide same dynamic range and precision of floating point implementation?  Code Release  Release of Verilog for Fixed-to-Float and Float-to-Fixed components in near future