Optimizing stencil code for FPGA Yang Liu
Overall Motivation Accelerate Stencil code on both software and hardware level. Software optimization: Algorithm level optimization Hardware optimization: Data transfer rate, parallelism, and a specially designed memory controller
Executive summary This project is intended to optimize stencil code performance on FPGA using OpenCL framework.
SDAccel Xilinx’s design acceleration tool enable faster development and better performance Supports standard OpenCL API to abstract hardware performance and optimize code to hardware Available on AWS cloud
SDAccel Design Flow
Stencil Algorithm Application Computer Fluid Simulation Partial Differential equation Many more..
Stencil Algorithm Depend on nearest neighbor 2D 1D
Why we need to improve
Current Progress 1-D, 2-D implementation of stencil code is completed. Optimization of 1-D, 2-D is half-way though. Will be able to meet the goal of my proposal.
System Design: Data Data set consists of 4096 bits random generated data. Generated using C random function
System Design: Program The stencil program is handwritten. Then OpenCl configuration code are based on Xilinx Sdaccel Example
Loop Unrolling out[i] = ALPHA * in1[i - 1] + in1[i + 1] + BETA * in1[i]; Vout_buffer[j] = ALPHA ^2 *(in1[j - 2] + v1_buffer[j + 2] + 2 * v1_buffer[j]) + BETA * ALPHA^2 * v1_buffer[j + 1] * v1_buffer[j - 1] + v1_buffer[j];
Loop unrolling problem Unused data at boundary will be larger. Compute Data Area Original Compute Data Area Unroll three times 3
Buffering Data movement between host and board have a very high leniency Resolution: Local buffer store part of the data Host 4096 Board Original Optimized 1024
Multiple instance Why just one, when we can have plenty?
System Test: Platform Based on Xilinx FPGA Local test KCU1500 Future test environment AWS F1 instance(VU9P)
Results: 1-D VS 1-D Optimized (Stencil Only)
Results: 2-D VS 2-D Optimized (Stencil only)
Results: 1-D VS 1-D Optimized (With Transfer)