Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

Compiler-Driven Data Layout Transformation for Heterogeneous Platforms
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Abstractions and Technology
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Presented by Rengan Xu LCPC /16/2014
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Department of Electrical Engineering National Cheng Kung University
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
OpenCL Introduction A TECHNICAL REVIEW LU OCT
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Architectural Optimizations David Ojika March 27, 2014.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
GBT Interface Card for a Linux Computer Carson Teale 1.
Automated Design of Custom Architecture Tulika Mitra
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.
GPU Architecture and Programming
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
OpenCL Programming James Perry EPCC The University of Edinburgh.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Single Node Optimization Computational Astrophysics.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Martin Kruliš by Martin Kruliš (v1.1)1.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Intel Many Integrated Cores Architecture
NFV Compute Acceleration APIs and Evaluation
FPGAs for next gen DAQ and Computing systems at CERN
Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Many-core Software Development Platforms
GPU Programming using OpenCL
Linchuan Chen, Xin Huo and Gagan Agrawal
Introduction to cosynthesis Rabi Mahapatra CSCE617
Implementation of IDEA on a Reconfigurable Computer
Performance Optimization for Embedded Software
Compiler Back End Panel
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Compiler Back End Panel
Alternative Processor Panel Results 2008
Hossein Omidian, Guy Lemieux
Multi-Core Programming Assignment
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng Bridging the FPGA Programmability-Portability Gap via Automatic OpenCL Code Generation and Tuning Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng

Still, low-level for traditional Motivation FPGAs have been very important in HPC and are projected to be important on the path to exascale performance + energy efficiency Programmability Still, low-level for traditional software developers and domain scientists “Low-level” “High-level” “Higher-level” Programming FPGAs Bluespec Impulse C Catapult C Synphony C VHDL Verilog FCUDA OpenCL Problem: Performance

Extending GLAF* to Target the FPGA Grid-based Language and Auto-tuning Framework Auto-parallelization Visual programming framework Automatic code generation Auto-tuning Loop parallelism detection OpenMP code annotation Click/Touch-based interface Minimal typing required C, Fortran Allows targeting CPU, Xeon Phi Data layout transf. (AoS, SoA) Loop interchange, collapse, etc. *Krommydas et al., “GLAF: A Visual Programming and Auto-Tuning Framework for Parallel Computing”, ICPP 2015

Background: OpenCL Workload partitioning Abstract Model 2D NDRange Host (CPU) Workload partitioning Host Memory Local size Lx Local size Ly Device (GPU, MIC, FPGA, …) Global size Gy Local Memory Compute Unit (CU) Local Memory Local Memory Local Memory Global size Gx 2D NDRange 32 work-items Global size: {8,4} Global/Constant Memory 4 work-groups Local size: {4,2} Work-group size: {2,2} Abstract Model

Using OpenCL to Program FPGAs Compile OpenCL kernel file to obtain aocx file Compile host (CPU) code to obtain x86 binary Run x86 binary on host Programs the FPGA w/ binary Executes the kernel(s) on the device (FPGA) 2 1 3 Device code (OpenCL Kernel file) Host code (C file) Altera SDK for OpenCL Standard C compiler $g++ (…library includes/linking) –o app app_ocl.c $aoc –board s5phq_d8 ocl_kernels.cl -v 2 1 3 $./app FPGA binary: ocl_kernels.aocx x86 executable: app PCI-e

Using GLAF-OCL to Program FPGAs CPU/Xeon Phi: C/Fortran & OpenMP GLAF Program Altera FPGA: Altera OpenCL automatically generate Device code (OpenCL Kernel file) Host code (C file) Altera SDK for OpenCL Standard C compiler $g++ (…library includes/linking) –o app app_ocl.c $aoc –board s5phq_d8 ocl_kernels.cl -v 2 1 3 $./app FPGA binary: ocl_kernels.aocx x86 executable: app PCI-e

Methodology: Automatic OpenCL Code Generation Convert loop indices to OpenCL NDRange Create a kernel for the loop body Perform data transfers as needed (eliminate redundant) Set appropriate kernel arguments Launch kernel

Methodology: Automatic OpenCL Code Generation Body of loop is transformed to OpenCL kernel Grids (variables) used are transformed to parameters Data parallelism (SPMD) is achieved via work-item IDs

Methodology: Automatic Altera OpenCL Optimizations Memory alignment alignedMalloc()  enables DMA transfers Restrict clause restrict  eliminates assumed mem. dependencies more efficient designs Constant cache __constant for read-only data  enables DMA transfers

Methodology: Automatic Altera OpenCL Optimizations Kernel vectorization (SIMD) kernel annotation: __attribute__ num_simd_work_items(N)  increased throughput Multiple Compute Units (CU) kernel annotation: __attribute__ num_compute_units(N) Single Work-Item Kernels Equivalent to OpenCL task  Loop pipelining inference

Methodology: Automatic Altera OpenCL Optimizations Initiation Interval (II) Reduction Loop relaxation reduces/eliminates II for reduction operations (identified by GLAF) Shift Register Inference Similar to II Reduction for sliding window computations – induce use of shift registers and couple with loop unrolling

Experimental Setup Hardware: CPU: Intel E5-2697 (12 cores@2.7 GHz) FPGA: Bittware S5-PCIe-HQ (Altera Stratix V GS) Software: gcc: v4.8.2 OpenCL: Altera OpenCL SDK v14.2 OS: Debian Linux (kernel version 3.2.46)

Results Overview vs. Comparing Parallel CPU implementation (auto-generated OpenMP via GLAF) FPGA implementations (auto-generated OpenCL via GLAF-OCL) vs. Three applications (from different domains) Electrostatic Surface Potential Calculation (NB) Gene Sequence Search (SS) Time-Domain FIR Filter (FF) Execution time (FPGA vs. CPU): 1.6x better 2.3x better 4.8x better Energy consumption (FPGA vs. CPU): 10.2x better 10.5x better 33.9x better

Results CUs: Compute units Const.: Constant memory NDR: NDRange Impl. Type CUs SIMD Const. NB0 NDR 1 N NB1 SWI NB2 8 NB3 16 NB4 2 NB5 3 NB6† NB0/NB2 & NB2/NB3: 8x length  7.32x speed-up, 2x length  1.81x speed-up (vectorizable code) With each 2x SIMD length  ~1.35x resource utilization (only datapath replicated, control logic shared across SIMD lanes) CU replication NB3/NB4/NB5: 1.68x for 2 CUs, 2.27x for 3 CUs (increasing CUs increases global mem b/w across CUs * doubling CUs leads to resource utilization doubling too) NDR vs. SWI: NB0/NB1: No difference here (both expressed via pipeline parallelism), BUT in others they enable further optimizations Resource-driven opt. NB5/NB6: NB6 loop unrolling x32, our choice SIMD16-CU3 is 1.2x faster than that (advantages/disadvantages) CUs: Compute units Const.: Constant memory NDR: NDRange SWI: Single work-item †: Resource-driven optimization

Results CUs: Compute units Const.: Constant memory NDR: NDRange Impl. Type CUs SIMD Const. FF0 NDR 4 16 Y FF1 SWI 1 FF2 10 FF3‡ FF4‡ FF5* CUs: Compute units Const.: Constant memory NDR: NDRange SWI: Single work-item ‡: Initiation interval reduction *: Shift register inference+full loop unrolling Constant cache used for filter coefficients In FF results are counter-intuitive Fastest SWI vs. fastest NDR FF5/FF0: 20.5x speed-up Importance of optimizations like constant memory to achieve FF5 (otherwise would not fit!) Also, without shift register optimization, full loop unrolling not possible i.e., optimizations in-concert! Facilitated by auto code-generation… Counter intuitive effects (e.g., FF1/FF2 or FF3/FF4: more CUs worse performance and higher resource util =/ better performance) Also, II reduction in FF3 vs. FF1 (1 cc vs. 8cc) doesn’t bring better performance (clock freq. is 1.36x slower == performance degradation, too)

Conclusions Extension of GLAF  GLAF-OCL facilitates use of FPGAs by domain experts and novice programmers Balance between performance and programmability for FPGAs Visual programming framework OpenCL automatic code generation Auto-parallelization FPGA-specific code optimization

Extra Slides

Methodology: Automatic Altera OpenCL Optimizations Initiation Interval Reduction

Results CUs: Compute units Const.: Constant memory NDR: NDRange Impl. Type CUs SIMD Const. SS0 NDR 4 8 N SS1 2 16 SS2 1 Y SS3 6 SS4✻ SS5* SWI CUs: Compute units Const.: Constant memory NDR: NDRange SWI: Single work-item ✻: Inner loop unrolling *: Shift register inference