Optimizing stencil code for FPGA

Slides:

Advertisements

Similar presentations

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Advertisements

Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.

Architectural Optimizations David Ojika March 27, 2014.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

GRECO - CIn - UFPE1 A Reconfigurable Architecture for Multi-context Application Remy Eskinazi Sant´Anna Federal University of Pernambuco – UFPE GRECO.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

CORDIC Based 64-Point Radix-2 FFT Processor

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

NFV Compute Acceleration APIs and Evaluation

Fast & Accurate Biophotonic Simulation for Personalized Photodynamic Cancer Therapy Treatment Planning Investigators: Vaughn Betz, University of Toronto.

Jehandad Khan and Peter Athanas Virginia Tech

Backprojection Project Update January 2002

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Virtual memory.

Parallel Beam Back Projection: Implementation

FPGA: Real needs and limits

Enabling machine learning in embedded systems

Tools and Services Workshop Overview of Atmosphere

Texas Instruments TDA2x and Vision SDK

CE-105 Spring 2007 Engr. Faisal ur Rehman

Genomic Data Clustering on FPGAs for Compression

FPGA: Real needs and limits

FPGAs in AWS and First Use Cases, Kees Vissers

Hot & Spicy: Improving Productivity with Python and HLS for FPGAs

Implementation of IDEA on a Reconfigurable Computer

Course Agenda DSP Design Flow.

MASS CUDA Performance Analysis and Improvement

Matlab as a Development Environment for FPGA Design

High Level Synthesis Overview

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

STUDY AND IMPLEMENTATION

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Wavelet “Block-Processing” for Reduced Memory Transfers

Compiler Code Optimizations

University of Wisconsin-Madison

Optimization for Fully Connected Neural Network for FPGA application

Final Project presentation

Design Principles of the CMS Level-1 Trigger Control and Hardware Monitoring System Ildefons Magrans de Abril Institute for High Energy Physics, Vienna.

1CECA, Peking University, China

H a r d w a r e M o d e l i n g O v e r v i e w

Mapping DSP algorithms to a general purpose out-of-order processor

CSE 471 Autumn 1998 Virtual memory

Implementation of a De-blocking Filter and Optimization in PLX

rePLay: A Hardware Framework for Dynamic Optimization

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

EEL4930/5934 Reconfigurable Computing

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Optimizing stencil code for FPGA Yang Liu

Overall Motivation Accelerate Stencil code on both software and hardware level. Software optimization: Algorithm level optimization Hardware optimization: Data transfer rate, parallelism, and a specially designed memory controller

Executive summary This project is intended to optimize stencil code performance on FPGA using OpenCL framework.

SDAccel Xilinx’s design acceleration tool enable faster development and better performance Supports standard OpenCL API to abstract hardware performance and optimize code to hardware Available on AWS cloud

SDAccel Design Flow

Stencil Algorithm Application Computer Fluid Simulation Partial Differential equation Many more..

Stencil Algorithm Depend on nearest neighbor 2D 1D

Why we need to improve

Current Progress 1-D, 2-D implementation of stencil code is completed. Optimization of 1-D, 2-D is half-way though. Will be able to meet the goal of my proposal.

System Design: Data Data set consists of 4096 bits random generated data. Generated using C random function

System Design: Program The stencil program is handwritten. Then OpenCl configuration code are based on Xilinx Sdaccel Example

Loop Unrolling out[i] = ALPHA * in1[i - 1] + in1[i + 1] + BETA * in1[i]; Vout_buffer[j] = ALPHA ^2 *(in1[j - 2] + v1_buffer[j + 2] + 2 * v1_buffer[j]) + BETA * ALPHA^2 * v1_buffer[j + 1] * v1_buffer[j - 1] + v1_buffer[j];

Loop unrolling problem Unused data at boundary will be larger. Compute Data Area Original Compute Data Area Unroll three times 3

Buffering Data movement between host and board have a very high leniency Resolution: Local buffer store part of the data Host 4096 Board Original Optimized 1024

Multiple instance Why just one, when we can have plenty?

System Test: Platform Based on Xilinx FPGA Local test KCU1500 Future test environment AWS F1 instance(VU9P)

Results: 1-D VS 1-D Optimized (Stencil Only)

Results: 2-D VS 2-D Optimized (Stencil only)

Results: 1-D VS 1-D Optimized (With Transfer)