200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.

Slides:

Advertisements

Similar presentations

AKS Implementation of a Deterministic Primality Algorithm

Advertisements

DFT & FFT Computation.

1 Fast Multiplication of Large Numbers Using Fourier Techniques Henry Skiba Advisor: Dr. Marcus Pendergrass.

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

David Hansen and James Michelussi

Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing,

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,

FFT-based filtering and the Short-Time Fourier Transform (STFT) R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.

Design of a Reconfigurable Hardware For Efficient Implementation of Secret Key and Public Key Cryptography.

Lucas-Lehmer Primality Tester Presentation 1: Proposal Team: Nathan Stohs Joe Hurley Brian Johnson Marques Johnson.

FFT1 The Fast Fourier Transform by Jorge M. Trabal.

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.

11/26/02CSE FFT,etc CSE Algorithms Polynomial Representations, Fourier Transfer, and other goodies. (Chapters 28-30)

Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Topic 7 - Fourier Transforms DIGITAL IMAGE PROCESSING Course 3624 Department of Physics and Astronomy Professor Bob Warwick.

Chapter 9 Mathematics of Cryptography Part III: Primes and Related Congruence Equations Copyright © The McGraw-Hill Companies, Inc. Permission required.

Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.

Information Security and Management 4. Finite Fields 8

B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.

Computational Technologies for Digital Pulse Compression

Efficient FPGA Implementation of QR

Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

EE/CS 481 Spring Founder’s Day, 2008 University of Portland School of Engineering Project Golden Eagle CMOS Fast Fourier Transform Processor Team.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Swankoski MAPLD 2005 / B103 1 Dynamic High-Performance Multi-Mode Architectures for AES Encryption Eric Swankoski Naval Research Lab Vijay Narayanan Penn.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

The Fast Fourier Transform and Applications to Multiplication

Frequency Domain Adaptive Filtering Project Supervisor Dr. Edward Jones Myles Ó Fríl.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

DEPARTMENTT OF ECE TECHNICAL QUIZ-1 AY Sub Code/Name: EC6502/Principles of digital Signal Processing Topic: Unit 1 & Unit 3 Sem/year: V/III.

Professor A G Constantinides 1 Discrete Fourier Transforms Consider finite duration signal Its z-tranform is Evaluate at points on z-plane as We can evaluate.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Fast Fourier Transforms. 2 Discrete Fourier Transform The DFT pair was given as Baseline for computational complexity: –Each DFT coefficient requires.

The Discrete Fourier Transform

9.1 Primes and Related Congruence Equations 23 Sep 2013.

Reconfigurable acceleration of robust frequency-domain echo cancellation C. H. Ho 1, K.F.C.Yiu 2, J. Huo 3, S. Nordholm 3 and W. Luk 1 1.Department of.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.

Husheng Li, UTK-EECS, Fall The specification of filter is usually given by the tolerance scheme.  Discrete Fourier Transform (DFT) has both discrete.

Low Power Design for a 64 point FFT Processor

CORDIC Based 64-Point Radix-2 FFT Processor

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

DIGITAL SIGNAL PROCESSING ELECTRONICS

Instructor: Dr. Phillip Jones

Fast Fourier Transforms Dr. Vinu Thomas

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

The Fast Fourier Transform

Chapter 9 Computation of the Discrete Fourier Transform

LECTURE 18: FAST FOURIER TRANSFORM

The Fast Fourier Transform

Fast Fourier Transform

LECTURE 18: FAST FOURIER TRANSFORM

Presentation transcript:

200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab Virginia Tech

200/MAPLD 2004 Craven2 Outline Background Large Integer Multiplication GIMPS Algorithm Comparison Floating-point FFT All-integer FFT Fast Galois Transform Accelerator Design System Design Operation Performance Improvements & Future Work

200/MAPLD 2004 Craven3 Large Integer Multiplication Complexity Grade School: O(N 2 ) Fourier Transform: ~O(N log N) Efficient FFT-Based Multiplication Divide integers into sequences of smaller digits  86, 75, 30, 92, 46, 01 Convolution of two sequences equivalent to multiplication. Element-wise multiplication in frequency domain  time domain convolution.

200/MAPLD 2004 Craven4 GIMPS Why multiply big numbers? Great Internet Mersenne Prime Search (GIMPS) Primality testing algorithm for Mersenne numbers (2 q – 1) requires squaring of multi-million digit numbers. Mersenne primes are largest primes known – used in cryptography. Large integer convolution Performance comparison of Pentiums and FPGAs in traditional floating-point domains. Lucas-Lehmer Primality Test Mq = 2 q – 1; v = 4; for i = 1:q-2, v = v 2 – 2 (mod Mq); if v == 0, Mq is prime else, Mq is composite

200/MAPLD 2004 Craven5 Discrete Weighted Transform Discrete Weighted Transform (DWT) Variable base – each sequence digit can contain differing numbers of bits. Creates power-of-two sequence needed by FFT. Eliminates need to zero pad to convert cyclic, FFT-based convolution into acyclic convolution needed for squaring. Steps: Number to be multiplied divided into variable-length digits. Sequence multiplied by a weight sequence. FFT performed on new, power-of-two length weighted sequence. Example for Mq = 2 37 – 1 with FFT length of 4: Bits / digit = { 10, 9, 9, 9 } To square 78,314,567,209 (mod Mq), our sequence would be: { 553, 93, 381, 291 } * * * 2 28 = 78,314,567,209 Multiply sequence by weights then FFT.

200/MAPLD 2004 Craven6 Objective Compare performance of Pentium processors to FPGAs. GIMPS chosen because highly optimized code exists. GIMPS utilizes fast floating-point performance of Pentiums. Xilinx Virtex-II Pro 100 (2VP100) chosen as target device. Largest available 2VP device. Contains 444, 17x17 unsigned multipliers 888kB of embedded Block RAM Target 12 million digit numbers. Reward for first prime above 10 million.

200/MAPLD 2004 Craven7 Floating-point FFT GIMPS implementation uses floating-point – requires round off error checks. Using near double-precision floating-point (51-bit mantissa): 49 real multipliers can be placed on 2VP complex multipliers 12 million digit number -> 2 million point FFT 44 million complex multiplies -> 3.7 million cycles

200/MAPLD 2004 Craven8 All-integer FFT Perform FFT modulo special prime. Prime must have nice roots of one & two. Reductions modulo prime should be simple. Primes of the form 2 k – 2 m + 1 meet requirements. Prime# MultipliersFFT LengthIteration time M1.9M cycles M1.7M cycles M2.6M cycles M2.3M cycles

200/MAPLD 2004 Craven9 Fast Galois Transform All-integer transform using complex numbers modulo a Mersenne Prime: a + b*i (mod Mp) Real input sequence folded into complex input with half the length. Modular reductions via Mersenne primes are simple addition. Prime# MultipliersFFT LengthIteration Time (complex)1M3.5M cycles (complex)512K3.3M cycles

200/MAPLD 2004 Craven10 Algorithm Selection Considered algorithms: Floating-point FFT3.7M cycles / iteration All-integer FFT1.7M cycles / iteration Galois Transform3.3M cycles / iteration Winograd Transform – no acceptable run lengths Chinese Remainder Theorem – added complexity

200/MAPLD 2004 Craven11 FFT Design Multipliers and adder generated by CoreGen. 10 cycle butterfly latency.

200/MAPLD 2004 Craven12 Complete Design 8-point FFTs lower cache throughput. Multiple caches allow for overlapping computation with memory reads and writes.

200/MAPLD 2004 Craven13 Performance Estimates XC2VP100-6ff1696 ISE version 6.2i Iteration time: 34 milliseconds FFT Engine frequency: 80 MHz 2VP 100 utilization: 70% slices * Not Implemented 24% BRAMs 86% multipliers Iteration StageTime (us) Weighted sequence creation* 250 Forward FFT11,500 DFT coefficient squaring250 Inverse FFT11,500 Weight removal*250 Carry releasing*5,000 Mersenne mod reduction*5,000

200/MAPLD 2004 Craven14 Performance Comparison Pentium 4 Performance: Non-SIMD (64-bit multiplies) 6.4 GFLOPs All-Integer transform leverages FPGA strengths: 1.9 billion integer multiplies /sec Transform performance exceeds P4. FPGA vs. Pentium 4: 34 ms vs. 60 ms => 1.76x speed-up! $10,000 vs. $500 => 20x more costly.  600 sq mm* vs. 146 sq mm => 4.1x more die area. †  FPGAs would likely be less costly if volume equaled the P4. † The P4 area estimate does not include the area required by all of the support chips. * 2VP100 die area extrapolated from 2VP20 data supplied by Semiconductor Insights (

200/MAPLD 2004 Craven15 Improvements & Future Work Pentium assemble code highly-optimized while HW accelerator is a first draft. Algorithm exploration Nussbaumer’s method using 17-bit primes Utilize “nice” form of prime to implement shift-only multiply for first two FFT stages. Cluster Implementation Configurable Computing Lab constructing a 16-node 2VP cluster with gigabit transceivers as interconnect. Alternative reduced-multiplier butterfly structures Floorplanning

200/MAPLD 2004 Craven16 Conclusions All-integer FFTs attractive for hardware implementations of filters / convolutions. GIMPS accelerator designed: Operates at 80 MHz 176% faster than 3.2 GHz Pentium 4 Cost of accelerator outweighs benefit in this application.