Optimizing stencil code for FPGA

Slides:



Advertisements
Similar presentations
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Advertisements

Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.
Architectural Optimizations David Ojika March 27, 2014.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
GRECO - CIn - UFPE1 A Reconfigurable Architecture for Multi-context Application Remy Eskinazi Sant´Anna Federal University of Pernambuco – UFPE GRECO.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CORDIC Based 64-Point Radix-2 FFT Processor
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
NFV Compute Acceleration APIs and Evaluation
Fast & Accurate Biophotonic Simulation for Personalized Photodynamic Cancer Therapy Treatment Planning Investigators: Vaughn Betz, University of Toronto.
Jehandad Khan and Peter Athanas Virginia Tech
Backprojection Project Update January 2002
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Virtual memory.
Parallel Beam Back Projection: Implementation
FPGA: Real needs and limits
Enabling machine learning in embedded systems
Tools and Services Workshop Overview of Atmosphere
Texas Instruments TDA2x and Vision SDK
CE-105 Spring 2007 Engr. Faisal ur Rehman
Genomic Data Clustering on FPGAs for Compression
FPGA: Real needs and limits
FPGAs in AWS and First Use Cases, Kees Vissers
Hot & Spicy: Improving Productivity with Python and HLS for FPGAs
Implementation of IDEA on a Reconfigurable Computer
Course Agenda DSP Design Flow.
MASS CUDA Performance Analysis and Improvement
Matlab as a Development Environment for FPGA Design
High Level Synthesis Overview
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
STUDY AND IMPLEMENTATION
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Wavelet “Block-Processing” for Reduced Memory Transfers
Compiler Code Optimizations
University of Wisconsin-Madison
Optimization for Fully Connected Neural Network for FPGA application
Final Project presentation
Design Principles of the CMS Level-1 Trigger Control and Hardware Monitoring System Ildefons Magrans de Abril Institute for High Energy Physics, Vienna.
1CECA, Peking University, China
H a r d w a r e M o d e l i n g O v e r v i e w
Mapping DSP algorithms to a general purpose out-of-order processor
CSE 471 Autumn 1998 Virtual memory
Implementation of a De-blocking Filter and Optimization in PLX
rePLay: A Hardware Framework for Dynamic Optimization
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
EEL4930/5934 Reconfigurable Computing
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Optimizing stencil code for FPGA Yang Liu

Overall Motivation Accelerate Stencil code on both software and hardware level. Software optimization: Algorithm level optimization Hardware optimization: Data transfer rate, parallelism, and a specially designed memory controller

Executive summary This project is intended to optimize stencil code performance on FPGA using OpenCL framework.

SDAccel Xilinx’s design acceleration tool enable faster development and better performance Supports standard OpenCL API to abstract hardware performance and optimize code to hardware Available on AWS cloud

SDAccel Design Flow

Stencil Algorithm Application Computer Fluid Simulation Partial Differential equation Many more..

Stencil Algorithm Depend on nearest neighbor 2D 1D

Why we need to improve

Current Progress 1-D, 2-D implementation of stencil code is completed. Optimization of 1-D, 2-D is half-way though. Will be able to meet the goal of my proposal.

System Design: Data Data set consists of 4096 bits random generated data. Generated using C random function

System Design: Program The stencil program is handwritten. Then OpenCl configuration code are based on Xilinx Sdaccel Example

Loop Unrolling out[i] = ALPHA * in1[i - 1] + in1[i + 1] + BETA * in1[i]; Vout_buffer[j] = ALPHA ^2 *(in1[j - 2] + v1_buffer[j + 2] + 2 * v1_buffer[j]) + BETA * ALPHA^2 * v1_buffer[j + 1] * v1_buffer[j - 1] + v1_buffer[j];

Loop unrolling problem Unused data at boundary will be larger. Compute Data Area Original Compute Data Area Unroll three times 3

Buffering Data movement between host and board have a very high leniency Resolution: Local buffer store part of the data Host 4096 Board Original Optimized 1024

Multiple instance Why just one, when we can have plenty?

System Test: Platform Based on Xilinx FPGA Local test KCU1500 Future test environment AWS F1 instance(VU9P)

Results: 1-D VS 1-D Optimized (Stencil Only)

Results: 2-D VS 2-D Optimized (Stencil only)

Results: 1-D VS 1-D Optimized (With Transfer)