Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

When Data Management Systems Meet Approximate Hardware: Challenges and Opportunities Author: Bingsheng He (Nanyang Technological University, Singapore)

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Multi-GPU System Design with Memory Networks

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

GPGPU platforms GP - General Purpose computation using GPU

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

By: Oleg Schtofenmaher Maxim Fudim Supervisor: Walter Isaschar Characterization presentation for project Winter 2007 ( Part A)

Architectural Optimizations David Ojika March 27, 2014.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Action-Oriented Query Processing for Pervasive Computing Qiong Luo Joint work with Wenwei Xue Hong Kong University of Science and Technology (HKUST)

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sunpyo Hong, Hyesoon Kim

EuroSys Doctoral Workshop 2011 Resource Provisioning of Web Applications in Heterogeneous Cloud Jiang Dejun Supervisor: Guillaume Pierre

My Coordinates Office EM G.27 contact time:

Philipp Gysel ECE Department University of California, Davis

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.

Prof. Zhang Gang School of Computer Sci. & Tech.

Gwangsun Kim, Jiyun Jeong, John Kim

Seth Pugsley, Jeffrey Jestes,

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Enabling Effective Utilization of GPUs for Data Management Systems

EECE571R -- Harnessing Massively Parallel Processors ece

Parallel Data Laboratory, Carnegie Mellon University

FPGAs in AWS and First Use Cases, Kees Vissers

Improving java performance using Dynamic Method Migration on FPGAs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

A first step towards GPU-assisted Query Optimization

The Yin and Yang of Processing Data Warehousing Queries on GPUs

Department of Computer Science University of California, Santa Barbara

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

Akshay Tomar Prateek Singh Lohchubh

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Main Memory Background Swapping Contiguous Allocation Paging

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

(A Research Proposal for Optimizing DBMS on CMP)

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

Accelerating Regular Path Queries using FPGA

Presentation transcript:

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST, Hong Kong) 1

Outline Background and Problem Challenges Observation Our Solution Experiment Conclusion 2

What is OpenCL? OpenCL stands for Open Computing Language. OpenCL has been developed for heterogeneous computing environments with a host-accelerator execution model. – CPU runs the control task. – FPGA runs the computing kernel. 3

“Architectural Evolution” of FPGA 4 Hardware-centric  Fine-grained parallelism Users need to program FPGA with hardware description languages (HDL) 

“Architectural Evolution” of FPGAs: From OpenCL’s Perspective 5 Users can program FPGA with OpenCL. Software-centric  FPGA as a parallel architecture. External DDR Memory blocks Logic blocks

Optimization Methods for OpenCL 6 Common optimizations – Thread Parallelism (TP) – Shared Memory (SM) – Memory Coalescing (MC) FPGA-specific optimizations – Compute Units (CU) – Kernel Vectorization (SIMD)

Optimization (CU) on FPGA 7 CU: Compute units for the kernel – Computing performance doubles. – Memory performance: Local memory performance doubles (private to its CU). Global memory performance depends (CUs share).

Optimization (SIMD) on FPGA 8 A=B+C Kernel Vectorization (SIMD):It allows multiple work items (or threads) to execute in single instruction multiple data (SIMD) fashion. No SIMD

Optimization (SIMD) on FPGA 9 A=B+C Kernel Vectorization (SIMD):It allows multiple work items to execute in single instruction multiple data (SIMD) fashion. No SIMD

Optimization (SIMD) on FPGA 10 A=B+C Kernel Vectorization (SIMD):It allows multiple work items to execute in single instruction multiple data (SIMD) fashion. With SIMD=4 No SIMD

Optimization (SIMD) on FPGA 11 A=B+C Kernel Vectorization (SIMD):It allows multiple work items to execute in single instruction multiple data (SIMD) fashion. With SIMD=4 No SIMD

Problem OmniDB [1]: State-of-the-art OpenCL-based query processor on CPU/GPU – Kernel-based execution – Common optimization methods – Cost-based approach to schedule How OmniDB performs on OpenCL-based FPGAs? 12 [1] Shuhao Zhang and et al. OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures, VLDB’13.

Outline Background and Problem Challenges Observation Our Solution Experiment Conclusion 13

Challenge (Large Exploration Space) A single SQL query can have many possible query execution plans on FPGAs. – Each query has multiple operators, and each operator consists of multiple OpenCL kernels. – Each OpenCL kernel can have different FPGA- specific optimization combinations. We also consider another dimension of using multiple FPGA images. 14

Challenge (Long Synthesis Time) 15 OpenCL Program FPGA Image 2-4 hours  Running all the feasible query plans on real FPGAs is not a good idea. We need one cost model to determine the optimal query plan via evaluation.

Outline Background and Problem Challenges Observation Our Solution Experiment Conclusion 16

Observation 17 There is an FPGA-specific trade-off between the following two factors. – Optimization combination for each kernel – Reconfiguration overhead More aggregative optimizations for each kernel  More resources for each kernel  More resources for the entire query  More FPGA images   Higher FPGA reconfiguration overhead 

Impact of Optimization Combination 18 Time and resource utilization of scanLargeArrays kernel scan) with 128M tuples More aggregative optimizations  More resource utilization  Higher performance

FPGA Reconfiguration Overhead 19 FPGA reconfiguration overhead is significant in the current FPGA board. According to Altera, FPGA reconfiguration overhead contains three sources. – Transfer the active contents (memory footprint) from FPGA memory to host memory via PCIe (roughly 2GB/s) – Fully reconfigure the FPGA (roughly ms). – Transfer the active contents from host memory to FPGA memory via PCIe (roughly 2GB/s)

Outline Background and Problem Challenges Observation Our Solution Experiment Conclusion 20

Our Approach 21 Query processor: accelerated with FPGA- specific optimizations FPGA-specific cost model: to determine the optimal query plan for the input query

Query Processor (Operator Kernel Level) 22 The layered design of query processor contains four operators (constituting the SQL query). – Selection (5 operator kernels) – Order-by (2 operator kernels) – Grouping and Aggregation (7 operator kernels) – Join (2 operator kernels)

Operator Kernel 23 Adopt the implementation of operator kernel from OmniDB, which has already explored the common optimizations – Thread Parallelism (TP) – Shared Memory (SM) – Memory coalescing (MC) Mainly focuses on FPGA-specific optimizations – Compute units (CU) – Kernel Vectorization (SIMD)

FPGA-specific Cost Model 24 We propose an FPGA-specific cost model to determine the optimal query plan for the input query. The cost model follows the layered design. – Unit Cost (for each operator kernel) – Optimal query plan generation (dynamic programming based approach)

Unit Cost for Each Operator Kernel 25

Query Plan Generation 26 Dynamic programming based approach is used.

Benefit of Layered Design 27 Researchers can keep exploring other optimizations (e.g., kernel fusion) to further accelerate each operator kernel. When the operator kernel is further optimized. – Profile and obtain new combination: – Re-run dynamic programming based approach to determine the optimal query plan for the queries which contain the optimized operator kernel.

Outline Background and Problem Challenges Observation Our Solution Experiment Conclusion 28

Experimental Setup Platform: – Terasic’s DE5-Net board: Altera Stratix V A7 and 4GB 2-bank DDR3 – PCI-e 2.0 (X8) – Altera OpenCL SDK version 14.0 Workloads: – Four queries (Q1, Q2, Q3 and Q4) – Tuple format:. Both keys and payloads are 4-bytes. 29 We use Q3 for example.

Details of Q3 SQL query: – SELECT S.key, SUM(S.payload) FROM S WHERE Lo ≤ S.paylaod ≤ Hi GROUP BY S.key 30 Q3: 12 operator kernels

Generation of Execution Plans 31 Execution Plan 2 FPGA image 1LUTsREGsRAMsDSPsFreq. Estimated M Measured M FPGA image 2LUTsREGsRAMsDSPsFreq. Estimated M Measured M FPGA image 3LUTsREGsRAMsDSPsFreq. Estimated M Measured M Execution plan 1 FPGA imageLUTsREGsRAMsDSPsFreq. Estimated Measured Our cost model can roughly predict the resource utilization and frequency of each FPGA image.

Break-even Point for Execution Plans 32 Our cost model can recommend the optimal execution plan for different table sizes. 1: execution plan 1 2: execution plan 2 Measured: real FPGA Estimated: cost model Break-even point Our cost model can roughly predict the performance for each execution plan.

Comparison with OmniDB on FPGA 33 FPGA reconfiguration overhead > Benefit from the reduced execution time (more aggregative optimizations for each involved kernel. OmniDB: one FPGA image without FPGA- specific optimizations One FPGA image

Comparison with OmniDB on FPGA 34 Three FPGA images FPGA reconfiguration overhead < Benefit from the reduced execution time OmniDB: one FPGA image without FPGA- specific optimizations

Outline Background and Problem Challenges Observation Our Solution Experiment Conclusion 35

Conclusion Since the architecture of FPGA is significantly different from that of CPU/GPU and OpenCL- based query processing has already designed for CPUs/GPUs, we need to revisit it on FPGAs. We develop an FPGA-specific cost model to determine the optimal query plan for the input query. Our proposed approach can achieve significant speedup over OmniDB on FPGA. 36

Wish List for Next-gen Database on FPGA Larger DDR Size, higher memory bandwidth PCI-e 3.0 (X16) Retaining DDR contents during FPGA reconfiguration Partial reconfiguration while using OpenCL (I know it is tough.) 37

Q & A Our Terasic’s DE5-Net FPGA board is denoted by Altera University Program. We thank John Freeman (Altera) for support. This work is supported by a MoE AcRF Tier 1 grant (MOE 2014-T ), an NUS startup grant and a HKUST startup grant (R9336). Our research group: Xtra Computing Group