A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern.

Slides:



Advertisements
Similar presentations
Function Evaluation Using Tables and Small Multipliers CS252A, Spring 2005 Jason Fong.
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Computer Organization CS224 Fall 2012 Lesson 19. Floating-Point Example  What number is represented by the single-precision float …00 
ECE 645 – Computer Arithmetic Lecture 11: Advanced Topics and Final Review ECE 645—Computer Arithmetic 4/22/08.
CENG536 Computer Engineering department Çankaya University.
Distributed Arithmetic
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
Energy and Delay Improvement via Decimal Floating Point Hossam A.H.Fahmy, Electronics and Communications Department, CairoUniversity Egypt and.
CPSC 321 Computer Architecture ALU Design – Integer Addition, Multiplication & Division Copyright 2002 David H. Albonesi and the University of Rochester.
ECEN 248 Integer Multiplication, Number Format Adopted from Copyright 2002 David H. Albonesi and the University of Rochester.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Computer Arithmetic Integers: signed / unsigned (can overflow) Fixed point (can overflow) Floating point (can overflow, underflow) (Boolean / Character)
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Computer Organization and Architecture Computer Arithmetic Chapter 9.
Computer Arithmetic Nizamettin AYDIN
Highest Performance Programmable DSP Solution September 17, 2015.
AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Efficient FPGA Implementation of QR
Variable Precision Floating Point Division and Square Root Albert Conti Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University,
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #6 – Modern.
Computer Arithmetic II Instructor: Mozafar Bag-Mohammadi Spring 2006 University of Ilam.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
Abdullah Aldahami ( ) March 12, Introduction 2. Background 3. Proposed Multiplier Design a.System Overview b.Fixed Point Multiplier.
Fixed and Floating Point Numbers Lesson 3 Ioan Despi.
J. Christiansen, CERN - EP/MIC
Computer Arithmetic II Instructor: Mozafar Bag-Mohammadi Ilam University.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Lecture 12: Integer Arithmetic and Floating Point CS 2011 Fall 2014, Dr. Rozier.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Lecture notes Reading: Section 3.4, 3.5, 3.6 Multiplication
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Wang Chen, Dr. Miriam Leeser, Dr. Carey Rappaport Goal Speedup 3D Finite-Difference Time-Domain.
IT253: Computer Organization
1 Lecture 10: Floating Point, Digital Design Today’s topics:  FP arithmetic  Intro to Boolean functions.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
CH.3 Floating Point Hardware and Algorithms 3/10/
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
S 2/e C D A Computer Systems Design and Architecture Second Edition© 2004 Prentice Hall Chapter 6 Overview Number Systems and Radix Conversion Fixed point.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Backprojection Project Update January 2002
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Parallel Beam Back Projection: Implementation
Computer Architecture & Operations I
Integer Division.
CS 232: Computer Architecture II
Outline Introduction Floating Point Arithmetic Adder Multiplier.
Centar ( Global Signal Processing Expo
CSCE 350 Computer Architecture
ECEG-3202 Computer Architecture and Organization
Presentation transcript:

A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University

Wang P166/MAPLD Outline Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the floating point library Conclusions and future work

Wang P166/MAPLD Variable Precision Floating Point Library A library of fully pipelined and parameterized floating point modules Implementations well suited for state of the art FPGAs – Xilinx Virtex II FPGAs and Altera Stratix devices – Embedded Multipliers and Block RAM Signal/image processing algorithms accelerated using this library

Wang P166/MAPLD Questions to Answer Why floating point (FP)? Why parameterized FP? Why FPGAs?

Wang P166/MAPLD Why Floating Point (FP) ? Fixed Point Limited range Number of bits grows for more accurate results Easy to implement in hardware Floating Point Dynamic range Accurate results More complex and higher cost to implement in hardware

Wang P166/MAPLD Floating Point Representation +/-e+biasf Sign Biased exponent Significand s=1.f (the 1 is hidden) 32-bits: 64-bits: 8 bits, bias= bits, bias= bits, IEEE single-precision format 52+1 bits, IEEE double-precision format (-1) s * 1.f * 2 e-BIAS

Wang P166/MAPLD Why Parameterized FP ? Minimize the bitwidth of each signal in the datapath – Make more parallel implementations possible – Reduce the power dissipation Further acceleration – Custom datapaths built in reconfigurable hardware using either fixed-point or floating point arithmetic – Hybrid representations supported through fixed-to-float and float-to-fixed conversions

Wang P166/MAPLD Why FPGA ? Flexible FPGA architecture – LUTs, BlockRAM or SRL16 – Embedded multiplier, embedded power PC Customize design architecture to suit algorithm – parallel, serial, bitwidth Fine grained parallelism – high computational workload Allow cost / performance tradeoffs – Small area, low latency, high throughput, low power dissipation Faster than general purpose processor, more flexible and lower cost than ASIC High Performance Signal Processing Easy and Affordable in FPGAs

Wang P166/MAPLD Outline Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the floating point library Conclusions and future work

Wang P166/MAPLD Parameterized FP Modules Arithmetic operation – fp_add : floating point addition – fp_sub : floating point subtraction – fp_mul : floating point multiplication – fp_div : floating point division – fp_sqrt : floating point square root Format control – denorm : introducing implied integer digit – rnd_norm : rounding and normalizing Format conversion – fix2float : converting from fixed point to floating point – float2fix : converting from floating point to fixed point

Wang P166/MAPLD What Makes Our Library Unique ? A superset of all floating point formats – including IEEE standard format Parameterized for variable precision arithmetic – Support custom floating point datapaths – Support hybrid fixed and floating point implementations Support fully pipelining – Synchronization signals Complete – Separate normalization – Rounding (“round to zero” and “round to nearest”) – Some error handling

Wang P166/MAPLD Generic Library Component Synchronization signals for pipelining – READY and DONE Some error handling features – EXCEPTION_IN and EXCEPTION_OUT

Wang P166/MAPLD One Example - Assembly of Modules 2  denorm + 1  fp_add + 1  rnd_norm = 1  IEEE single precision adder

Wang P166/MAPLD Another Example - Floating Point Multiplier (-1) s1 * 1.f1 * 2 e1-BIAS (-1) s2 * 1.f2 * 2 e2-BIAS (-1) s1 xor s2 * (1.f1*1.f2) * 2 (e1+e2-BIAS)-BIAS x

Wang P166/MAPLD Latency ModuleLatency (clock cycles) denorm0 rnd_norm2 fp_add / fp_sub4 fp_mul3 fp_div14 fp_sqrt14 fix2float(unsigned/signed)4/5 float2fix(unsigned/signed)4/5 Clock rate of each module is similar

Wang P166/MAPLD Outline Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the library Conclusions and future work

Wang P166/MAPLD Algorithms for Division and Square Root Division – P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn, “Fast division algorithm with a small lookup table," Asilomar Conference,1999 Square Root – M. D. Ercegovac, T. Lang, J.-M. Muller, and A. Tisserand, “Reciprocation, square root, inverse square root, and some elementary functions using small multipliers," IEEE Transactions on Computers, vol. 2, pp , 2000

Wang P166/MAPLD Why Choose These Algorithms? Both algorithms are simple and elegant – Based on Taylor series – Use small table-lookup method with small multipliers Very well suited to FPGA implementations – BlockRAM, distributed memory, embedded multiplier – Lead to a good tradeoff of area and latency Can be fully pipelined – Clock speed similar to all other components

Wang P166/MAPLD Division Algorithm Dividend X and divisor Y are 2m-bit fixed-point number [1,2) Y is decomposed into higher order bit part and lower order bit part, which are defined as,where

Wang P166/MAPLD Division Algorithm – Continue Using Taylor series Error less than ½ ulp Two multipliers and one Table-Lookup are required

Wang P166/MAPLD Division – Data Flow Dividend XDivisor Y MultiplierLookup Table Multiplier Result 2m bits m bits 2m+2 bits

Wang P166/MAPLD Square Root – Data Flow Reduction Evaluation Postprocessing Multiplier Y A B=f(A)M sqrt(Y) A2A3A Reduce the input Y to a very small number A Compute first terms of Taylor series

Wang P166/MAPLD Square Root – Reduction M Table R Multiplier M Y )(k Y Y k bits 4k bits ^ R k+1 bits A2A3A4 1 ^  RYA 4k bits

Wang P166/MAPLD Square Root - Evaluation ^2Multiplier A2A3A4 Multiple Operand Signed Adder A B A2*A2 A2*A3 A2*A2*A2 A2 A k bits k bits Multiplier 2k bits k bits 4k bits

Wang P166/MAPLD Our Experiments Designs specified in VHDL Mapped to Xilinx Virtex II FPGA (XC2V3000) – System clock rates up to 300 MHz – Density up to 8M system gates – 14,336 slices – 96 18x18 Embedded Multipliers – 96 18Kb BlockRAM (1,728 Kb) – 448 Kb Distributed Memory Currently targeting Annapolis Wildcard- II

Wang P166/MAPLD Results - FP Divider on XC2V3000 Floating Point Format8(2,5)16(4,11)24(6,17)32(8,23) # of slices69 (1%)110 (1%)254 (1%)335 (2%) # of BlockRAM1 (1%) 7 (7%) # of 18x18 Embedded Multiplier2 (2%) 8 (8%) Clock period (ns)81099 Maximum frequency (MHz) # of clock cycles to obtain final results10 14 Latency (ns)=clock period x # of clock cycles Throughput (million results/second) The last column is the IEEE single precision floating point format

Wang P166/MAPLD Results - FP Square Root on XC2V3000 Floating Point Format8(2,5)16(4,11)24(6,17)32(8,23) # of slices113 (1%)253 (1%)338 (2%)401 (2%) # of BlockRAM3 (3%) # of 18x18 Embedded Multiplier4 (4%)5 (5%)9 (9%) Clock period (ns) Maximum frequency (MHz) # of clock cycles to obtain final results91213 Latency (ns)=clock period x # of clock cycles Throughput (million results/second)

Wang P166/MAPLD Outline Project overview Library hardware modules Floating point divider and square root K-means clustering application for multi- spectral satellite images using the library Conclusions and future work

Wang P166/MAPLD Application : K-means Clustering for Multispectral Satellite Images

Wang P166/MAPLD K-means – Iterative Algorithm Each cluster has a center (mean value) – Initialized on host – Initialization done once for complete image processing Cluster assignment – Distance (Manhattan norm) of each pixel and cluster center Accumulation of pixel value of each cluster Mean update via dividing the accumulator value by number of pixels (done once per iteration) Previously done on host Moved to FPGA with fp_div

Wang P166/MAPLD K-means Clustering – Functional Units Subtraction Addition Comparison DATAPATH Abs. Value Pixel Data Validity Cluster Centers Memory Acknowledge Data Valid Datapath Pixel Shift Accumulators Cluster Assignment Mean Update Division Pixel Data

Wang P166/MAPLD Outline Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the library Conclusions and future work

Wang P166/MAPLD Conclusion A Library of fully pipelined and parameterized hardware modules for floating point arithmetic Flexibility in forming custom floating point formats New module fp_div and fp_sqrt have small area and low latency, are easily pipelined K-means clustering algorithm applied to multispectral satellite images makes use of fp_div

Wang P166/MAPLD Future Work More applications using – fp_div and fp_sqrt New library modules – ACC, MAC, INV_SQRT Use floating point lib to implement floating point coprocessor on FPGA with embedded processor