Presentation is loading. Please wait.

Presentation is loading. Please wait.

Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng

Similar presentations


Presentation on theme: "Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng"— Presentation transcript:

1 Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng
Bridging the FPGA Programmability-Portability Gap via Automatic OpenCL Code Generation and Tuning Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng

2 Still, low-level for traditional
Motivation FPGAs have been very important in HPC and are projected to be important on the path to exascale performance + energy efficiency Programmability Still, low-level for traditional software developers and domain scientists “Low-level” “High-level” “Higher-level” Programming FPGAs Bluespec Impulse C Catapult C Synphony C VHDL Verilog FCUDA OpenCL Problem: Performance

3 Extending GLAF* to Target the FPGA
Grid-based Language and Auto-tuning Framework Auto-parallelization Visual programming framework Automatic code generation Auto-tuning Loop parallelism detection OpenMP code annotation Click/Touch-based interface Minimal typing required C, Fortran Allows targeting CPU, Xeon Phi Data layout transf. (AoS, SoA) Loop interchange, collapse, etc. *Krommydas et al., “GLAF: A Visual Programming and Auto-Tuning Framework for Parallel Computing”, ICPP 2015

4 Background: OpenCL Workload partitioning Abstract Model 2D NDRange
Host (CPU) Workload partitioning Host Memory Local size Lx Local size Ly Device (GPU, MIC, FPGA, …) Global size Gy Local Memory Compute Unit (CU) Local Memory Local Memory Local Memory Global size Gx 2D NDRange 32 work-items Global size: {8,4} Global/Constant Memory 4 work-groups Local size: {4,2} Work-group size: {2,2} Abstract Model

5 Using OpenCL to Program FPGAs
Compile OpenCL kernel file to obtain aocx file Compile host (CPU) code to obtain x86 binary Run x86 binary on host Programs the FPGA w/ binary Executes the kernel(s) on the device (FPGA) 2 1 3 Device code (OpenCL Kernel file) Host code (C file) Altera SDK for OpenCL Standard C compiler $g++ (…library includes/linking) –o app app_ocl.c $aoc –board s5phq_d8 ocl_kernels.cl -v 2 1 3 $./app FPGA binary: ocl_kernels.aocx x86 executable: app PCI-e

6 Using GLAF-OCL to Program FPGAs
CPU/Xeon Phi: C/Fortran & OpenMP GLAF Program Altera FPGA: Altera OpenCL automatically generate Device code (OpenCL Kernel file) Host code (C file) Altera SDK for OpenCL Standard C compiler $g++ (…library includes/linking) –o app app_ocl.c $aoc –board s5phq_d8 ocl_kernels.cl -v 2 1 3 $./app FPGA binary: ocl_kernels.aocx x86 executable: app PCI-e

7 Methodology: Automatic OpenCL Code Generation
Convert loop indices to OpenCL NDRange Create a kernel for the loop body Perform data transfers as needed (eliminate redundant) Set appropriate kernel arguments Launch kernel

8 Methodology: Automatic OpenCL Code Generation
Body of loop is transformed to OpenCL kernel Grids (variables) used are transformed to parameters Data parallelism (SPMD) is achieved via work-item IDs

9 Methodology: Automatic Altera OpenCL Optimizations
Memory alignment alignedMalloc()  enables DMA transfers Restrict clause restrict  eliminates assumed mem. dependencies more efficient designs Constant cache __constant for read-only data  enables DMA transfers

10 Methodology: Automatic Altera OpenCL Optimizations
Kernel vectorization (SIMD) kernel annotation: __attribute__ num_simd_work_items(N)  increased throughput Multiple Compute Units (CU) kernel annotation: __attribute__ num_compute_units(N) Single Work-Item Kernels Equivalent to OpenCL task  Loop pipelining inference

11 Methodology: Automatic Altera OpenCL Optimizations
Initiation Interval (II) Reduction Loop relaxation reduces/eliminates II for reduction operations (identified by GLAF) Shift Register Inference Similar to II Reduction for sliding window computations – induce use of shift registers and couple with loop unrolling

12 Experimental Setup Hardware: CPU: Intel E5-2697 (12 cores@2.7 GHz)
FPGA: Bittware S5-PCIe-HQ (Altera Stratix V GS) Software: gcc: v4.8.2 OpenCL: Altera OpenCL SDK v14.2 OS: Debian Linux (kernel version )

13 Results Overview vs. Comparing
Parallel CPU implementation (auto-generated OpenMP via GLAF) FPGA implementations (auto-generated OpenCL via GLAF-OCL) vs. Three applications (from different domains) Electrostatic Surface Potential Calculation (NB) Gene Sequence Search (SS) Time-Domain FIR Filter (FF) Execution time (FPGA vs. CPU): 1.6x better 2.3x better 4.8x better Energy consumption (FPGA vs. CPU): 10.2x better 10.5x better 33.9x better

14 Results CUs: Compute units Const.: Constant memory NDR: NDRange
Impl. Type CUs SIMD Const. NB0 NDR 1 N NB1 SWI NB2 8 NB3 16 NB4 2 NB5 3 NB6† NB0/NB2 & NB2/NB3: 8x length  7.32x speed-up, 2x length  1.81x speed-up (vectorizable code) With each 2x SIMD length  ~1.35x resource utilization (only datapath replicated, control logic shared across SIMD lanes) CU replication NB3/NB4/NB5: 1.68x for 2 CUs, 2.27x for 3 CUs (increasing CUs increases global mem b/w across CUs * doubling CUs leads to resource utilization doubling too) NDR vs. SWI: NB0/NB1: No difference here (both expressed via pipeline parallelism), BUT in others they enable further optimizations Resource-driven opt. NB5/NB6: NB6 loop unrolling x32, our choice SIMD16-CU3 is 1.2x faster than that (advantages/disadvantages) CUs: Compute units Const.: Constant memory NDR: NDRange SWI: Single work-item †: Resource-driven optimization

15 Results CUs: Compute units Const.: Constant memory NDR: NDRange
Impl. Type CUs SIMD Const. FF0 NDR 4 16 Y FF1 SWI 1 FF2 10 FF3‡ FF4‡ FF5* CUs: Compute units Const.: Constant memory NDR: NDRange SWI: Single work-item ‡: Initiation interval reduction *: Shift register inference+full loop unrolling Constant cache used for filter coefficients In FF results are counter-intuitive Fastest SWI vs. fastest NDR FF5/FF0: 20.5x speed-up Importance of optimizations like constant memory to achieve FF5 (otherwise would not fit!) Also, without shift register optimization, full loop unrolling not possible i.e., optimizations in-concert! Facilitated by auto code-generation… Counter intuitive effects (e.g., FF1/FF2 or FF3/FF4: more CUs worse performance and higher resource util =/ better performance) Also, II reduction in FF3 vs. FF1 (1 cc vs. 8cc) doesn’t bring better performance (clock freq. is 1.36x slower == performance degradation, too)

16 Conclusions Extension of GLAF  GLAF-OCL facilitates use of FPGAs by domain experts and novice programmers Balance between performance and programmability for FPGAs Visual programming framework OpenCL automatic code generation Auto-parallelization FPGA-specific code optimization

17 Extra Slides

18 Methodology: Automatic Altera OpenCL Optimizations
Initiation Interval Reduction

19 Results CUs: Compute units Const.: Constant memory NDR: NDRange
Impl. Type CUs SIMD Const. SS0 NDR 4 8 N SS1 2 16 SS2 1 Y SS3 6 SS4✻ SS5* SWI CUs: Compute units Const.: Constant memory NDR: NDRange SWI: Single work-item ✻: Inner loop unrolling *: Shift register inference


Download ppt "Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng"

Similar presentations


Ads by Google