Results – Peak Streaming Performance Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.
14. Aug Towards Practical Lattice-Based Public-Key Encryption on Reconfigurable Hardware SAC 2013, Burnaby, Canada Thomas Pöppelmann and Tim Güneysu.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Derivatives Inside Black Scholes
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Options and Speculative Markets Inside Black Scholes Professor André Farber Solvay Business School Université Libre de Bruxelles.
Chapter 14 The Black-Scholes-Merton Model Options, Futures, and Other Derivatives, 8th Edition, Copyright © John C. Hull
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
FIGURE 1-1 A Computer System
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Concession Length and Investment Timing Flexibility Chiara D’Alpaos, Cesare Dosi and Michele Moretto.
Valuing Stock Options: The Black–Scholes–Merton Model
GPGPU platforms GP - General Purpose computation using GPU
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Time Domain Representation of Linear Time Invariant (LTI).
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Publication: Ra Inta, David J. Bowman, and Susan M. Scott. Int. J. Reconfig. Comput. 2012, Article 2 (January 2012), 1 pages. DOI= /2012/ Naveen.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
J. Christiansen, CERN - EP/MIC
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Fundamentals of Futures and Options Markets, 7th Ed, Ch 13, Copyright © John C. Hull 2010 Valuing Stock Options: The Black-Scholes-Merton Model Chapter.
Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.
1 Restoration of Star-Field Images Using High-Level Languages and Core Libraries Robin Bruce, Caroline Ruet, Dr Malachy Devlin, Prof Stephen Marshall.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
1 MathFinance Colloquium Frankfurt, June 1 st, 2006 Exploring the Limits of Closed Pricing Formulas in the Black and Scholes.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
13.1 Valuing Stock Options : The Black-Scholes-Merton Model Chapter 13.
Connecting EPICS with Easily Reconfigurable I/O Hardware EPICS Collaboration Meeting Fall 2011.
Overview Real World NP-hard problems, such as fluid dynamics, calcium cell signaling, and stomata networks in plant leaves involve extensive computation.
1 The Black-Scholes Model Chapter 13 (7 th edition) Ch 12 (pre 7 th edition) + Appendix (all editions)
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Cray XD1 Reconfigurable Computing for Application Acceleration.
Reconfigurable Supercomputing (2) Key Issues in HPC  Leveling off of performance Traditional Scalar/Vector – long product cycles, too few vendors.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.
Chapter 14 The Black-Scholes-Merton Model
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
A Streaming FFT on 3GSPS ADC Data using Core Libraries and DIME-C
A Streaming FFT on 3GSPS ADC Data using Core Libraries and DIME-C
Enabling machine learning in embedded systems
THE PROCESS OF EMBEDDED SYSTEM DEVELOPMENT
Chapter 18 MobileApp Design
Chapter 15 The Black-Scholes-Merton Model
Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara
Problem Solving: Structure Charts
Final Project presentation
Chapter 15 The Black-Scholes-Merton Model
Minimum Entropy Restoration Using FPGAs And High-Level Techniques
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
Presentation transcript:

Results – Peak Streaming Performance Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations Robin Bruce Javier Setoain, Richard Chamberlain, Malachy Devlin & Rosa M. Badia For more information, you may contact Abstract This poster outlines the Nallatech Accelerator Layer (NAL) FPGA programming environment and its relationship to Intel’s Accelerator Abstraction Layer. A general look at FPGAs versus stored- program processors is given. Hardware platforms that support the NAL are presented: the Nallatech H101, the Intel FSB-FPGA Module and the BenOne PCIe. To demonstrate the NAL system, two closed-form expressions are implemented. These functions are single-precision floating-point, and make use of arithmetic operations and elementary functions. The functions selected were the probability density function (PDF) and the Black- Scholes-Merton options pricing formula (BSM). These functions were implemented on a dual-core Opteron, a Nallatech H101 card (with a Xilinx Virtex-4 LX160 FPGA) and BenOne PCIe (LX160 FPGA) card using the NAL, an NVIDIA G80 using CUDA and a Cell BE system using the CellSs programing environment. The aim was to use the same ANSI C code for the kernels in all computing environments. The GPU system showed the best silicon performance for the implementation of these kernels. Including data transfer times, the BenOne PCIe FPGA platform had the highest performance. ` Probability Density Function & Black Scholes #include #define SIZE 8192 #define RECIP_ROOT2PI float ndf(float x) { float x_sqrd; float exp_in; float result; x_sqrd = x * x; exp_in = ldexpf(x_sqrd,-1); result = RECIP_ROOT2PI * expf(-exp_in); return result; } float pdf(float x, float mu, float sigma) { float recip_sigma; float ndf_in; float ndf_out; float result; recip_sigma = 1.0 / sigma; ndf_in = (x - mu) * recip_sigma; ndf_out = ndf(ndf_in); result = recip_sigma * ndf_out; return result; } void pdf_main(float output[SIZE], float x[SIZE], float mu[SIZE], float sigma[SIZE]) { int i=0; for(i=0; i<SIZE; i++){ output[i] = pdf(x[i],mu[i],sigma[i]); } DIME-C Code for Probability Density Function FPGAs versus Stored-Program ProcessorsNallatech Reconfigurable Computing Platforms The diagram shows the efficiency of a selection of modern processing technologies for application data types that range from bit-level processing to symbolic processing. Efficiency is an admittedly subjective composite of size, weight, energy consumption, absolute performance and time to solution. The diagram reflects how, with each generation, stored-program processors and FPGAs are evolving from their respective symbolic and bit- level roots to become ever more capable vector/streaming processors. The NAL is a set of C++ classes that functions as a system-level design environment for Nallatech reconfigurable computing platforms. The NAL was designed to complement Intel’s Accelerator Abstraction Layer (AAL) as a programming environment for the FSB-FPGA accelerator. Nallatech’s high-level language compiler DIME-C is a prominent component of the NAL. It allows for the compilation of ANSI C code to VHDL targeted at Xilinx FPGAs. The NAL is a set of C++ classes that functions as a system-level design environment for Nallatech reconfigurable computing platforms. The NAL approach permits the system-level modeling of multiple DIME-C blocks, something that in the previous DIMETalk-based system could not be reliably modeled at the software level. Probability Density Function Black Scholes Merton Without Data Transfer (MOPS) With Data Transfer (MOPS) Without Data Transfer (MOPS) With Data Transfer (MOPS) OpteronN/A4.75N/A3.4 H101 PCI-X (LX160) Cell BE G BenOne PCIe (LX160) The formula gives the price C of a European call option with exercise price K on a stock currently trading at price S, i.e., the right to buy a share of the stock at price K after T years. The constant risk-free interest rate is r, and the constant stock volatility is σ. Φ is the standard normal cumulative distribution function, shown in equation (3). The error function, though not theoretically closed form, can be adequately evaluated in single-precision arithmetic by means of a Taylor expansion, making it closed- form from a computational perspective. Formula for Black Scholes Merton Options Pricing Formula The code used in DIME-C for both the PDF and Black Scholes Implementations was unchanged or virtually unchanged in the Opteron, CellSs and CUDA kernel implementations, though naturally the top-level code for each had to take into account the differing environments. The scenario for the results presented here is that there is a fictional application running on the host processor(s) that has a constant stream of input values for which it needs output values from the closed-form function implemented on the attached accelerator. The Diagram shows the unique and shared properties of the FSB- FPGA Platform, the H101 PCI-X Card and the Ben One PCIe FPGA Compute Platforms. The NAL sits atop the AAL accelerator API when used to program the FSB-FPGA. At present the NAL sits atop the FUSE API in the PCI-X and PCIe Platforms Acknowledgements The lead author’s research is sponsored by Nallatech, and partially funded by the UK Engineering and Physical Sciences Research Council. The Institute for System Level Integration and Strathclyde University, both in Scotland, provide academic and logistical support. This work has also been supported by the Spanish government through the research contracts CICYT-TIN 2005/5619 and Ingenio 2010 Consolider CSD00C The Opteron had the weakest performance for the functions. The GPU had the strongest silicon performance, the performance discounting data transfer. When taking into account the data transfer, then the outcome depended on the method of interconnect used. Amongst the accelerators, the PCI-X H101 FPGA accelerator card had the lowest overall transfer-inclusive performance, followed by the Cell Processor then the GPU, with the BenOne implementation coming out on top. Cell had the lowest silicon potential of the accelerators, but was most balanced in terms of silicon potential and data transfer bandwidth